During the 2019 Virtual Summit, Ulta Beauty’s Rohit Mishra and Robert Murphy gave a presentation on the company’s approach to system changes made to resolve ATG application memory issues. Ulta made several internal changes in order to tighten testing cycles for the 2018 holiday preparations.
At Ulta, holiday preparedness is of utmost importance. Holiday Readiness Testing occurs with the help of all application teams. They test the frontend with the cloud and the backend with a micro-focus load runner. Then, they analyze the results and make sure the hardware and network capacity are adequate. They also take a look at the integrated whole, for best results.
There are two test profiles. The first is focused on peak web traffic on Thanksgiving. The second analyzes peak shipment traffic on Cyber Monday. As all retailers know, the holidays are a very busy time. Ulta tries to simulate anticipated shopping in order to be prepared. After the holiday ends, Ulta immediately begins planning for the following year’s holiday traffic. They build a new workload model to match up with expected volume. Throughout the year, the team gradually integrates new functionality that becomes available into the existing test suite. This cross-functional effort is led by the performance team.
Application Tuning at Ulta Beauty
In 2017, Ulta faced major issues with capacity. While it appeared that they had the capacity to handle the load, their JVMs did not perform well. Additionally, garbage collection wasn’t as effective as desired. Essentially, a load that was sustained for four to five hours produced memory that would begin building on top of itself. This memory would never get reclaimed. A top priority of 2018 was to identify this performance-hindering issue.
As a lesson learned from the 2017 holiday season, Ulta introduced a more robust performance and load testing cycle, which focused on running a long duration soak test. Proactive collection of Java memory dumps and garbage collection logs from these tests uncovered both major contributors to JVM heap consumption and sources of gradual memory leaks in the application. Two key contributors (prolonging from 2017) significantly increased the Java memory footprint and negatively impacted the application throughout. The culprits were a workspace history maintained within the Endeca XM content, which pushed to ATG servers and proxy objects created by ATG Multisite framework, which lived even after the user session was expired.
Removing Workspace History
One of the identified issues impacting memory consumption was a result of five years of use of Endeca Experience Manager. EEM was used for creating and managing the user experience and site content. Over time, Ulta’s usage of EEM increased significantly as most of the browse and navigation experience moved to Experience Manager. Out of the Box content zip captured the entire history of workspaces created in EEM for editing and authoring content. This big zip file was then promoted and loaded onto the page serving JVMs through Assemble API. Under load, JVM memory reached the configured threshold, triggering garbage collection, resulting in the observation that the reclaimed memory was significantly less than expected. Consequently, there were frequent major garbage collection cycles. This reduced the application throughout, forcing a full GC post-test (after letting residual user sessions expire) and did not collect enough heap. The above symptoms led to a heap analysis, which uncovered EEM content sitting immortally in the Java memory. Roughly 95 percent of the heap occupied by the content was contributed by workspace history.
The following image shows a snapshot of the objects being investigated. You can see that the first field shows the GC size for the XML which was being pushed through live.
The Solution
Based on these results, Ulta customized the content promotion process. They removed the workspace history before the content was promoted via custom scripts to the application instances. Subsequent load tests showed much-improved GC throughout. They could then allocate more sessions without turning too many major or full GC cycles. The memory consumption on an idle JVM went down from 4.5GB to 2.5 GB, and their threshold would trigger a CMS GC as 6 GB heap occupancy.
Removing Proxy Objects
After cleaning up the Endeca workspaces, memory footprints of the application improved significantly. However, as they started to push more load, applications exhibited similar behavior of spinning off inefficient Garbage Collection cycles with less than expected heap reclaimed in each. Ulta collected Java memory dumps to further analyze the top consumers of the heap and the reason they remained immortal inside the heap. Investigations revealed that the ATG Multisite has an inherent implementation to create proxy objects for the components declared as shareable between the two logical sites. These proxies were getting tied to the JVM root while ATG discarded the references as a session expired.
Due to the direct affinity with JVM root, these objects were held in the memory and were never reclaimed. When the application was put under a sustained load, the efficiency of GC cycles kept diminishing over time to the point that it was collecting very few MBs per major garbage collection.
The image below shows 534+ proxy objects over forty minutes in a single JVM enrollment stress test.
Since Ulta does not use or intend to make use of Multisite features, they decided to eliminate the sharable type components completely defined in the SiteGroupManager component. They validated the site functionally, ran normal and holiday peak load, followed by sustained load tests (soak tests). Their memory footprints dropped further as a result, and an idle JVM post full GC was holding up just 1.5 GB. During execution of the load, GC throughout jumped to 99.9 percent. During the peak load, Ulta applications did one major GC in an hour, and even fewer when the load normalized. Garage collection patterns were close to ideal for the load versus application capacity under their possession.
Revisiting Caching Strategies
Ulta revisited caching strategies, just as they do every year. They adjusted inside and outside the application level. They had some APIs and services which were not designed in a way that could be easily cached, so the services were redesigned to make them cacheable at the edge.
Ulta also keeps a close eye on ATG application repositories caches and associated hits or misses. They introduce new data sources and build new repositories on those data sources. Taking into account the introduction of a new repository, they considered inciting the caches. This approach is part of their basic preparedness for an application-level head check.
ATG Database Optimization
Archival & Partitioning
Under Oracle’s suggestion, Ulta added additional Oracle nodes to increase the database capacity. One of the top database administrator (DBA) concerns was that of the growing size of data negatively impacting query performance. Further, business tendencies to treat ATG’s transactional database as a source of tuning near real-time reports were often expensive in nature. Though most of the reports were generated via an offline database which is replicated every 15 minutes, 15 holiday minutes was too long of a time period for business to gain visibility into inventory positions and sales/demand data.
Ulta has an internal policy on data retention and archival. After discussing with business partners, DevOps and DBAs decided to de-normalize the transactional data older than three years by capturing header level information and flattening it out in a single master table. Transactional tables carried data three years and younger. The remaining data was removed after archival and placed into a de-normalized master table. Thus, a middle ground was found between performance concerns and Enterprise policies on data retention. This saved Ulta a significant amount on storage cost and improved the overall performance of queries by running reports on a smaller transactional database.
Removing Index Contentions
Beginning in 2017, when Ulta was preparing for the holiday, they saw several index contentions on the LAST_MODIFIED_DATE table. They discovered that ATG OOTB supplied the DDL for DCSPP_ORDER, which creates an index on LAST_MODIFIED_DATE. The issue was in regard to concurrent levels going beyond a certain threshold where a large number of threads manipulated and updated Shopping Cart. The update activity intensifies on this column. The inherent nature of the ATG platform is to update the LAST_MODIFIED_DATE timestamp each time UpdateOrder is called. Due to the value being a timestamp and the column being indexed, contention was created at the index itself as Oracle RDBMS struggled to catch up with an extremely high number of updates to the column. This also altered the value option for each attempt (taking up a millisecond of time).
Oracle has a knowledge article on the My Oracle Support portal that suggested to drop the index if faced by contention. Dropping, however, meant that Ulta’s custom CSC indexing process would break. They ultimately customized the code to delay subsequent updates to this column (for incomplete carts), but also log only up to minutes rather than the entire timestamp. This would ignore the second and millisecond fractions. Dropping second and millisecond fractions from LAST_MODIFIED_DATE updates helped retain the column (which is indexed) value for longer, thereby reducing the overhead on Oracle RDBMS to manage quick changes to indexes, each time column is updated.