New Technology Emerges to Archive Web Pages

The lingering stone age methods used to archive web content for legal and compliance purposes hit Steve Marsh like a brick after a recent survey conducted by his email archiving firm.

"When we asked about web content, we found that most firms are still doing manual archiving and reviewing. Most were still printing and storing web pages. If compliance [departments] want to do a review of web documents, they will print out or fax a page and file it away, or later print them and turn them into a PDF," says Marsh, CEO and founder of Smarsh (SMASP), which just introduced a new web archiving tool the firm hopes will draw banks out of the caves.

Using tech it recently acquired from a web archiving startup called Perpetually, Smarsh has built a web archiving system that's almost like a time machine — it captures the original source code for web pages, which lets people to view that page as it appeared in a single point in time. "You can see the web page in its working form. You aren't looking at a screen shot," Marsh says. "So when a compliance officer or a legal person or another exec want to view sites, they can see how the site looked and felt at the time."

Smarsh is competing with tech firms such as Symantec (SYMC) and HP (HPQ) to build web archiving tools to accommodate more complex web content, such as video attachments, web links and social media. This content is increasingly included in communications between financial firms and business partners, and financial firms and clients. It's also subject to regulatory scrutiny and is harder to archive and deliver to third parties by using traditional screen shots and text capture.

"Smarsh's new product allows it to archive static and dynamic web content," says Brian Hill, a principal analyst at Forrester Research. Hill says the expansion of social media use by banks has placed digital archiving at a tipping point, opening up the market for hosted services that offer advanced techniques to capture live web content. It's a nascent market that should pick up quickly given the intersection of tech development and regulatory mandates. "Smarsh is getting into this relatively early, but it's a good move for them," Hill says.

The archiving rules come from U.S. agencies such as FINRA and the SEC and international regulatory bodies such as the UK's Financial Services Authority. The regulations are primarily designed to monitor communications to regulate marketing messages and content that can be construed as marketing financial products, investment or services, and call for electronic content to be stored for potential compliance audits and subpoena.

"Electronically stored information is subject to discovery…the focus on upgrading technology to archive other types of content, such as file shares, is a national progression to the focus on having that information available," says Hill.

Smarsh's web archiving captures a web site's individual pages, and the content of those pages, in the original format. That provides a record to what was published online at any point in time. Archived web pages are rendered with the original design and experience. The interactive elements are still functional and the links between pages are preserved. That includes full websites, blogs, wikis, RSS feeds, audio and video files, as well as interactive elements such as YouTube Videos, slideshows, Javascript and Flash content. Each file is time-stamped and stored in Smarsh's data centers. Smarsh says the data centers are geographically diverse and SAS 70 Type II audited.

"You can use it to do research and monitor web pages, and it can alert you when a change has been made," says Marsh, who said there were four clients active at the time of launch in late May, with about a dozen others in contract — mostly financial firms such as banks, broker dealers and financial advisors. It has about 15,000 clients for its email archiving product across industries.

The archived content can be downloaded to a PC, encrypted and either saved or imported directly to third-party legal review platforms. The firm says its technology differs from screen scraping because it places all of the components and coding that underpin a web site into the archive, rather that grabbing the site's data or textual content off of the screen. Smarsh says retrieving all of the components of a web site allows the archived content to be richer and more interactive.

Other providers of web archiving include HP, which ramped up its digital archiving through its acquisition of enterprise information management provider Autonomy; Proofpoint, which offers a hosted enterprise archiving product; and Symantec, which recently purchased data archiving and storage vendor LiveOffice to enhance its ability to archive email, instant messaging and file sharing. (Other companies, such as TeaLeaf, similarly archive web pages in such a way that they can be retrieved as they were at a point in time for the purpose of fixing problems and improving the customer experience.)

David Scott, a product manager at Symantec, says archiving web content in the social media age is further complicated by the mix of communication tools that a bank's staff may use to share digital content, which requires dynamic archiving and search capabilities.

"It's evolving. We are seeing scenarios such as a customer posting to a corporate Facebook page to complain about customer service. But the bank typically won't respond on the Facebook page in all cases to such a posting. They may respond in a more private channel, then go back to Facebook to post in the more public forum that everything has been resolved with that customer. So from an archiving and discovery perspective, you need to be able to see all communication vehicles that are connected to that action," Scott says.

Scott also says archiving needs change frequently as staff adopt different web communication and sharing tools on their own. He says one of Symantec's financial clients is archiving more than 20 different sources of content, mostly involving social media and instant messaging. "They are adding somewhere in the neighborhood of two new sources ever month due to the demand from various departments to use different tools."

For reprint and licensing requests for this article, click here.
Bank technology
MORE FROM AMERICAN BANKER