Netflix open sources its data traffic cop, Suro

Netflix is open sourcing a tool called Suro that the company uses to direct data from its source to its destination in real time. More than just serving a key role in the Netflix data pipeline, though, it’s also a great example of the impressive — if not sometimes redunant — ecosystem of open source data-analysis tools making their way out of large web properties.

Netflix’s various applications generate tens of billions of events per day, and Suro collects them all before sending them on their way. Most head to Hadoop (via Amazon S3) for batch processing, while others head to Druid and ElasticSearch (via Apache Kafka) for real-time analysis. According to the Netflix blog post explaining Suro (which goes into much more depth), the company is also looking at how it might use real-time processing engines such as Storm or Samza to perform machine learning on event data.

An example Suro workflow. Source: Netflix

An example Suro workflow. Source: Netflix

To anyone familiar with the big data space, the names of the various technologies at play here are probably associated to some degree with the companies that developed them. Netflix created Suro, LinkedIn created Kafka and Samza, Twitter (actually, Backtype, which Twitter acquired) created Storm, and Metamarkets (see disclosure) created Druid. Suro, the blog post’s authors acknowledged, is based on the Apache Chukwa project and is similar to Apache Flume (created by Hadoop vendor Cloudera) and Facebook’s Scribe. Hadoop, of course, was all but created at Yahoo and has since seen notable ecosystem contributions from all sorts of web companies.

I sometimes wonder whether all these companies really need to be creating their own technologies all the time, or if they could often get by using the stuff their peers have already created. Like most things in life, though, the answer to that question is probably best decided on a case-by-case basis. Storm, for example, is becoming a very popular tool for stream processing, but LinkedIn felt like it needed something different and thus built Samza. Netflix decided it needed Suro as opposed to just using some pre-existing technologies (largely because of its cloud-heavy infrastructure running largely in Amazon Web Services), but also clearly uses lots of tools built elsewhere (including the Apache Cassandra database).

A diagram of LinkedIn's data architecture as of February 2013.

A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata.

Hopefully, the big winners in all this innovation will be mainstream technology users that don’t have the in-house talent (or, necessarily, the need) to develop these advanced systems in-house but could benefit from their capabilities. We’re already seeing Hadoop vendors, for example, trying to make projects such as Storm and the Spark processing framework usable by their enterprise customers, and it seems unlikely they’ll be the last. There are a whole lot of AWS users, after all, and they might want the capabilities Suro can provide without having to rely on Amazon to deliver them. (We’ll likely hear a lot more about the future of Hadoop as a data platform at our Structure Data conference, which takes place in March.)

(source: Gigaom.com)

 

Next-generation search and analytics with Apache Lucene and Solr 4

Use search engine technology to build fast, efficient, and scalable data-driven applications

Apache Lucene™ and Solr™ are highly capable open source search technologies that make it easy for organizations to enhance data access dramatically. With the 4.x line of Lucene and Solr, it’s easier than ever to add scalable search capabilities to your data-driven applications. Lucene and Solr committer Grant Ingersoll walks you through the latest Lucene and Solr features that relate to relevance, distributed search, and faceting. Learn how to leverage these capabilities to build fast, efficient, and scalable next-generation data-driven applications.

I began writing about Solr and Lucene for developerWorks six years ago (see Resources). Over those years, Lucene and Solr established themselves as rock-solid technologies (Lucene as a foundation for Java™ APIs, and Solr as a search service). For instance, they power search-based applications for Apple iTunes, Netflix, Wikipedia, and a host of others, and they help to enable the IBM Watson question-answering system.

Over the years, most people’s use of Lucene and Solr focused primarily on text-based search. Meanwhile, the new and interesting trend of big data emerged along with a (re)new(ed) focus on distributed computation and large-scale analytics. Big data often also demands real-time, large-scale information access. In light of this shift, the Lucene and Solr communities found themselves at a crossroads: the core underpinnings of Lucene began to show their age under the stressors of big data applications such as indexing all of the Twittersphere (see Resources). Furthermore, Solr’s lack of native distributed indexing support made it increasingly hard for IT organizations to scale their search infrastructure cost-effectively.

The community set to work overhauling the Lucene and Solr underpinnings (and in some cases the public APIs). Our focus shifted to enabling easy scalability, near-real-time indexing and search, and many NoSQL features — all while leveraging the core engine capabilities. This overhaul culminated in the Apache Lucene and Solr 4.x releases. These versions aim directly at solving next-generation, large-scale, data-driven access and analytics problems.

This article walks you through the 4.x highlights and shows you some code examples. First, though, you’ll go hands-on with a working application that demonstrates the concept of leveraging a search engine to go beyond search. To get the most from the article, you should be familiar with the basics of Solr and Lucene, especially Solr requests. If you’re not, see Resources for links that will get you started with Solr and Lucene.

Quick start: Search and analytics in action

Search engines are only for searching text, right? Wrong! At their heart, search engines are all about quickly and efficiently filtering and then ranking data according to some notion of similarity (a notion that’s flexibly defined in Lucene and Solr). Search engines also deal effectively with both sparse data and ambiguous data, which are hallmarks of modern data applications. Lucene and Solr are capable of crunching numbers, answering complex geospatial questions (as you’ll see shortly), and much more. These capabilities blur the line between search applications and traditional database applications (and even NoSQL applications).

For example, Lucene and Solr now:

  • Support several types of joins and grouping options
  • Have optional column-oriented storage
  • Provide several ways to deal with text and with enumerated and numerical data types
  • Enable you to define your own complex data types and storage, ranking, and analytics functions

A search engine isn’t a silver bullet for all data problems. But the fact that text search was the primary use of Lucene and Solr in the past shouldn’t prevent you from using them to solve your data needs now or in the future. I encourage you to think about using search engines in ways that go well outside the proverbial (search) box.

To demonstrate how a search engine can go beyond search, the rest of this section shows you an application that ingests aviation-related data into Solr. The application queries the data — most of which isn’t textual — and processes it with the D3 JavaScript library (see Resources) before displaying it. The data sets are from the Research and Innovative Technology Administration (RITA) of the U.S. Department of Transportation’s Bureau of Transportation Statistics and from OpenFlights. The data includes details such as originating airport, destination airport, time delays, causes of delays, and airline information for all flights in a particular time period. By using the application to query this data, you can analyze delays between particular airports, traffic growth at specific airports, and much more.

Start by getting the application up and running, and then look at some of its interfaces. Keep in mind as you go along that the application interacts with the data by interrogating Solr in various ways.

Setup

To get started, you need the following prerequisites:

  • Lucene and Solr.
  • Java 6 or higher.
  • A modern web browser. (I tested on Google Chrome and Firefox.)
  • 4GB of disk space — less if you don’t want to use all of the flight data.
  • Terminal access with a bash (or similar) shell on *nix. For Windows, you need Cygwin. I only tested on OS X with the bash shell.
  • wget if you choose to download the data by using the download script that’s in the sample code package. You can also download the flight data manually.
  • Apache Ant 1.8+ for compilation and packaging purposes, if you want to run any of the Java code examples.

See Resources for links to the Lucene, Solr, wget, and Ant download sites.

With the prerequisites in place, follow these steps to get the application up and running:

  1. Download this article’s sample code ZIP file and unzip it to a directory of your choice. I’ll refer to this directory as $SOLR_AIR.
  2. At the command line, change to the $SOLR_AIR directory:
    cd $SOLR_AIR
  3. Start Solr:
    ./bin/start-solr.sh
  4. Run the script that creates the necessary fields to model the data:
    ./bin/setup.sh
  5. Point your browser at http://localhost:8983/solr/#/ to display the new Solr Admin UI. Figure 1 shows an example:
    Figure 1. Solr UI

    Screen capture the new Solr UI

  6. At the terminal, view the contents of the bin/download-data.sh script for details on what to download from RITA and OpenFlights. Download the data sets either manually or by running the script:
    ./bin/download-data.sh

    The download might take significant time, depending on your bandwidth.

  7. After the download is complete, index some or all of the data.

    To index all data:

    bin/index.sh

    To index data from a single year, use any value between 1987 and 2008 for the year. For example:

    bin/index.sh 1987
  8. After indexing is complete (which might take significant time, depending on your machine), point your browser at http://localhost:8983/solr/collection1/travel. You’ll see a UI similar to the one in Figure 2:
    Figure 2. The Solr Air UI

    Screen capture of an example Solr AIR screen

Exploring the data

With the Solr Air application up and running, you can look through the data and the UI to get a sense of the kinds of questions you can ask. In the browser, you should see two main interface points: the map and the search box. For the map, I started with D3’s excellent Airport example (see Resources). I modified and extended the code to load all of the airport information directly from Solr instead of from the example CSV file that comes with the D3 example. I also did some initial statistical calculations about each airport, which you can see by mousing over a particular airport.

I’ll use the search box to showcase a few key pieces of functionality that help you build sophisticated search and analytics applications. To follow along in the code, see the solr/collection1/conf/velocity/map.vm file.

The key focus areas are:

  • Pivot facets
  • Statistical functionality
  • Grouping
  • Lucene and Solr’s expanded geospatial support

Each of these areas helps you get answers such as the average delay of arriving airplanes at a specific airport, or the most common delay times for an aircraft that’s flying between two airports (per airline, or between a certain starting airport and all of the nearby airports). The application uses Solr’s statistical functionality, combined with Solr’s longstanding faceting capabilities, to draw the initial map of airport “dots” — and to generate basic information such as total flights and average, minimum, and maximum delay times. (This capability alone is a fantastic way to find bad data or at least extreme outliers.) To demonstrate these areas (and to show how easy it is to integrate Solr with D3), I’ve implemented a bit of lightweight JavaScript code that:

  1. Parses the query. (A production-quality application would likely do most of the query parsing on the server side or even as a Solr query-parser plugin.)
  2. Creates various Solr requests.
  3. Displays the results.

The request types are:

  • Lookup per three-letter airport code, such as RDU or SFO.
  • Lookup per route, such as SFO TO ATL or RDU TO ATL. (Multiple hops are not supported.)
  • Clicking the search button when the search box is empty to show various statistics for all flights.
  • Finding nearby airports by using the near operator, as in near:SFO or near:SFO TO ATL.
  • Finding likely delays at various distances of travel (less than 500 miles, 500 to 1000, 1000 to 2000, 2000 and beyond), as in likely:SFO.
  • Any arbitrary Solr query to feed to Solr’s /travel request handler, such as &q=AirportCity:Francisco.

The first three request types in the preceding list are all variations of the same type. These variants highlight Solr’s pivot faceting capabilities to show, for instance, the most common arrival delay times per route (such asSFO TO ATL) per airline per flight number. The near option leverages the new Lucene and Solr spatial capabilities to perform significantly enhanced spatial calculations such as complex polygon intersections. Thelikely option showcases Solr’s grouping capabilities to show airports at a range of distances from an originating airport that had arrival delays of more than 30 minutes. All of these request types augment the map with display information through a small amount of D3 JavaScript. For the last request type in the list, I simply return the associated JSON. This request type enables you to explore the data on your own. If you use this request type in your own applications, you naturally would want to use the response in an application-specific way.

Now try out some queries on your own. For instance, if you search for SFO TO ATL, you should see results similar to those in Figure 3:

Figure 3. Example SFO TO ATL screen

Screen capture from Solr Air showing SFO TO ATL results

In Figure 3, the two airports are highlighted in the map that’s on the left. The Route Stats list on the right shows the most common arrival delay times per flight per airline. (I loaded the data for 1987 only.) For instance, it tells you that five times Delta flight 156 was delayed five minutes on arriving in Atlanta and was six minutes early on four occasions.

You can see the underlying Solr request in your browser’s console (for example, in Chrome on the Mac, choose View -> Developer -> Javascript Console) and in the Solr logs. The SFO-TO-ATL request that I used (broken into three lines here solely for formatting purposes) is:

/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=Origin:SFO 
AND Dest:ATL&q=*:*&facet.pivot=UniqueCarrier,FlightNum,ArrDelay&
f.UniqueCarrier.facet.limit=10&f.FlightNum.facet.limit=10

The facet.pivot parameter provides the key functionality in this request. facet.pivot pivots from the airline (called UniqueCarrier) to FlightNum through to ArrDelay, thereby providing the nested structure that’s displayed in Figure 3‘s Route Stats.

If you try a near query, as in near:JFK, your result should look similar to Figure 4:

Figure 4. Example screen showing airports near JFK

Screen capture from Solr Air showing JFK and nearby airports

The Solr request that underlies near queries takes advantage of Solr’s new spatial functionality, which I’ll detail later in this article. For now, you can likely discern some of the power of this new functionality by looking at the request itself (shortened here for formatting purposes):

...
&fq=source:Airports&q=AirportLocationJTS:"IsWithin(Circle(40.639751,-73.778925 d=3))"
...

As you might guess, the request looks for all airports that fall within a circle whose center is at 40.639751 degrees latitude and -73.778925 degrees longitude and that has a radius of 3 degrees, which is roughly 111 kilometers.

By now you should have a strong sense that Lucene and Solr applications can slice and dice data — numerical, textual, or other — in interesting ways. And because Lucene and Solr are both open source, with a commercial-friendly license, you are free to add your own customizations. Better yet, the 4.x line of Lucene and Solr increases the number of places where you can insert your own ideas and functionality without needing to overhaul all of the code. Keep this capability in mind as you look next at some of the highlights of Lucene 4 (version 4.4 as of this writing) and then at the Solr 4 highlights.

Lucene 4: Foundations for next-generation search and analytics

A sea change

Lucene 4 is nearly a complete rewrite of the underpinnings of Lucene for better performance and flexibility. At the same time, this release represents a sea change in the way the community develops software, thanks to Lucene’s new randomized unit-testing framework and rigorous community standards that relate to performance. For instance, the randomized test framework (which is available as a packaged artifact for anyone to use) makes it easy for the project to test the interactions of variables such as JVMs, locales, input content and queries, storage formats, scoring formulas, and many more. (Even if you never use Lucene, you might find the test framework useful in your own projects.)

Some of the key additions and changes to Lucene are in the categories of speed and memory, flexibility, data structures, and faceting. (To see all of the details on the changes in Lucene, read the CHANGES.txt file that’s included within every Lucene distribution.)

Speed and memory

Although prior Lucene versions are generally considered to be fast enough — especially, relative to comparable general-purpose search libraries — enhancements in Lucene 4 make many operations significantly faster than in previous versions.

The graph in Figure 5 captures the performance of Lucene indexing as measured in gigabytes per hour. (Credit Lucene committer Mike McCandless for the nightly Lucene benchmarking graphs; seeResources.) Figure 5 shows that a huge performance improvement occurred in the first half of May [[year?]]:

Figure 5. Lucene indexing performance

Graph of Lucene indexing performance that shows an increase from 100GB per hour to approximately 270GB per hour in the first half of May [[year?]]

The improvement that Figure 5 shows comes from a series of changes that were made to how Lucene builds its index structures and how it handles concurrency when building them (along with a few other changes, including JVM changes and use of solid-state drives). The changes focused on removing synchronizations while Lucene writes the index to disk; for details (which are beyond this article’s scope) seeResources for links to Mike McCandless’s blog posts.

In addition to improving overall indexing performance, Lucene 4 can perform near real time (NRT) indexing operations. NRT operations can significantly reduce the time that it takes for the search engine to reflect changes to the index. To use NRT operations, you must do some coordination in your application between Lucene’s IndexWriter andIndexReader. Listing 1 (a snippet from the download package’s src/main/java/IndexingExamples.java file) illustrates this interplay:

Listing 1. Example of NRT search in Lucene
...
doc = new HashSet<IndexableField>();
index(writer, doc);
//Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));
printResults(searcher);
//Now, index one more doc
doc.add(new StringField("id", "id_" + 100, Field.Store.YES));
doc.add(new TextField("body", "This is document 100.", Field.Store.YES));
writer.addDocument(doc);
//The results are still 100
printResults(searcher);
//Don't commit; just open a new searcher directly from the writer
searcher = new IndexSearcher(DirectoryReader.open(writer, false));
//The results now reflect the new document that was added
printResults(searcher);
...

In Listing 1, I first index and commit a set of documents to the Directory and then search the Directory — the traditional approach in Lucene. NRT comes in when I proceed to index one more document: Instead of doing a full commit, Lucene creates a new IndexSearcher from the IndexWriter and then does the search. You can run this example by changing to the $SOLR_AIR directory and executing this sequence of commands:

  1. ant compile
  2. cd build/classes
  3. java -cp ../../lib/*:. IndexingExamples

Note: I grouped several of this article’s code examples into IndexingExamples.java, so you can use the same command sequence to run the later examples in Listing 2 and Listing 4.

The output that prints to the screen is:

...
Num docs: 100
Num docs: 100
Num docs: 101
...

Lucene 4 also contains memory improvements that leverage some more-advanced data structures (which I cover in more detail in Finite state automata and other goodies). These improvements not only reduce Lucene’s memory footprint but also significantly speed up queries that are based on wildcards and regular expressions. Additionally, the code base moved away from working with Java String objects in favor of managing large allocations of byte arrays. (The BytesRef class is seemingly ubiquitous under the covers in Lucene now.) As a result, String overhead is reduced and the number of objects on the Java heap is under better control, which reduces the likelihood of stop-the-world garbage collections.

Some of the flexibility enhancements also yield performance and storage improvements because you can choose better data structures for the types of data that your application is using. For instance, as you’ll see next, you can choose to index/store unique keys (which are dense and don’t compress well) one way in Lucene and index/store text in a completely different way that better suits text’s sparseness.

Flexibility

What’s a segment? A Lucene segment is a subset of the overall index. In many ways a segment is a self-contained mini-index. Lucene builds its index by using segments to balance the availability of the index for searching with the speed of writing. Segments are write-once files during indexing, and a new one is created every time you commit during writing. In the background, by default, Lucene periodically merges smaller segments into larger segments to improve read performance and reduce system overhead. You can exercise complete control over this process.

The flexibility improvements in Lucene 4.x unlock a treasure-trove of opportunity for developers (and researchers) who want to squeeze every last bit of quality and performance out of Lucene. To enhance flexibility, Lucene offers two new well-defined plugin points. Both plugin points have already had a significant impact on the way Lucene is both developed and used.

The first new plugin point is designed to give you deep control over the encoding and decoding of a Lucene segment. The Codec class defines this capability. Codec gives you control over the format of the postings list (that is, the inverted index), Lucene storage, boost factors (also called norms), and much more.

In some applications you might want to implement your own Codec. But it’s much more likely that you’ll want to change the Codec that’s used for a subset of the document fields in the index. To understand this point, it helps to think about the kinds of data you are putting in your application. For instance, identifying fields (for example, your primary key) are usually unique. Because primary keys only ever occur in one document, you might want to encode them differently from how you encode the body of an article’s text. You don’t actually change the Codec in these cases. Instead, you change one of the lower-level classes that the Codec delegates to.

To demonstrate, I’ll show you a code example that uses my favorite Codec, the SimpleTextCodec. TheSimpleTextCodec is what it sounds like: a Codec for encoding the index in simple text. (The fact thatSimpleTextCodec was written and passes Lucene’s extensive test framework is a testament to Lucene’s enhanced flexibility.) SimpleTextCodec is too large and slow to use in production, but it’s a great way to see what a Lucene index looks like under the covers, which is why it is my favorite. The code in Listing 2 changes a Codec instance to SimpleTextCodec:

Listing 2. Example of changing Codec instances in Lucene
...
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
//Let's write to disk so that we can see what it looks like
writer = new IndexWriter(directory, conf);
index(writer, doc);//index the same docs as before
...

By running the Listing 2 code, you create a local build/classes/simpletext directory. To see the Codec in action, change to build/classes/simpletext and open the .cfs file in a text editor. You can see that the .cfs file truly is plain old text, like the snippet in Listing 3:

Listing 3. Portion of _0.cfs plain-text index file
...
  term id_97
    doc 97
  term id_98
    doc 98
  term id_99
    doc 99
END
doc 0
  numfields 4
  field 0
    name id
    type string
    value id_100
  field 1
    name body
    type string
    value This is document 100.
...

For the most part, changing the Codec isn’t useful until you are working with extremely large indexes and query volumes, or if you are a researcher or search-engine maven who loves to play with bare metal. Before changing Codecs in those cases, do extensive testing of the various available Codecs by using your actual data. Solr users can set and change these capabilities by modifying simple configuration items. Refer to the Solr Reference Guide for more details (see Resources).

The second significant new plugin point makes Lucene’s scoring model completely pluggable. You are no longer limited to using Lucene’s default scoring model, which some detractors claim is too simple. If you prefer, you can use alternative scoring models such as BM25 and Divergence from Randomness (see Resources), or you can write your own. Why write your own? Perhaps your “documents” represent molecules or genes; you want a fast way of ranking them, but term frequency and document frequency aren’t applicable. Or perhaps you want to try out a new scoring model that you read about in a research paper to see how it works on your content. Whatever your reason, changing the scoring model requires you to change the model at indexing time through the IndexWriterConfig.setSimilarity(Similarity) method, and at search time through theIndexSearcher.setSimilarity(Similarity) method. Listing 4 demonstrates changing the Similarity by first running a query that uses the default Similarity and then re-indexing and rerunning the query using Lucene’s BM25Similarity:

Listing 4. Changing Similarity in Lucene
conf = new IndexWriterConfig(Version.LUCENE_44, analyzer);
directory = new RAMDirectory();
writer = new IndexWriter(directory, conf);
index(writer, DOC_BODIES);
writer.close();
searcher = new IndexSearcher(DirectoryReader.open(directory));
System.out.println("Lucene default scoring:");
TermQuery query = new TermQuery(new Term("body", "snow"));
printResults(searcher, query, 10);

BM25Similarity bm25Similarity = new BM25Similarity();
conf.setSimilarity(bm25Similarity);
Directory bm25Directory = new RAMDirectory();
writer = new IndexWriter(bm25Directory, conf);
index(writer, DOC_BODIES);
writer.close();
searcher = new IndexSearcher(DirectoryReader.open(bm25Directory));
searcher.setSimilarity(bm25Similarity);
System.out.println("Lucene BM25 scoring:");
printResults(searcher, query, 10);

Run the code in Listing 4 and examine the output. Notice that the scores are indeed different. Whether the results of the BM25 approach more accurately reflect a user’s desired set of results is ultimately up to you and your users to decide. I recommend that you set up your application in a way that makes it easy for you to run experiments. (A/B testing should help.) Then compare not only the Similarity results, but also the results of varying query construction, Analyzer, and many other items.

Finite state automata and other goodies

A complete overhaul of Lucene’s data structures and algorithms spawned two especially interesting advancements in Lucene 4:

  • DocValues (also known as column stride fields).
  • Finite State Automata (FSA) and Finite State Transducers (FST). I’ll refer to both as FSAs for the remainder of this article. (Technically, an FST is outputs values as its nodes are visited, but that distinction isn’t important for the purposes of this article.)

Both DocValues and FSA provide significant new performance benefits for certain types of operations that can affect your application.

On the DocValues side, in many cases applications need to access all of the values of a single field very quickly, in sequence. Or applications need to do quick lookups of values for sorting or faceting, without incurring the cost of building an in-memory version from an index (a process that’s known as un-inverting). DocValues are designed to answer these kinds of needs.

An application that does a lot of wildcard or fuzzy queries should see a significant performance improvement due to the use of FSAs. Lucene and Solr now support query auto-suggest and spell-checking capabilities that leverage FSAs. And Lucene’s default Codec significantly reduces disk and memory footprint by using FSAs under the hood to store the term dictionary (the structure that Lucene uses to look up query terms during a search). FSAs have many uses in language processing, so you might also find Lucene’s FSA capabilities to be instructive for other applications.

Figure 6 shows an FSA that’s built from http://examples.mikemccandless.com/fst.py using the words moppop,mothstarstop, and top, along with associated weights. From the example, you can imagine starting with input such as moth, breaking it down into its characters (m-o-t-h), and then following the arcs in the FSA.

Figure 6. Example of an FSA

Illustration of an FSA from http://examples.mikemccandless.com/fst.py

Listing 5 (excerpted from the FSAExamples.java file in this article’s sample code download) shows a simple example of building your own FSA by using Lucene’s API:

Listing 5. Example of a simple Lucene automaton
String[] words = {"hockey", "hawk", "puck", "text", "textual", "anachronism", "anarchy"};
Collection<BytesRef> strings = new ArrayList<BytesRef>();
for (String word : words) {
  strings.add(new BytesRef(word));

}
//build up a simple automaton out of several words
Automaton automaton = BasicAutomata.makeStringUnion(strings);
CharacterRunAutomaton run = new CharacterRunAutomaton(automaton);
System.out.println("Match: " + run.run("hockey"));
System.out.println("Match: " + run.run("ha"));

In Listing 5, I build an Automaton out of various words and feed it into a RunAutomaton. As the name implies, aRunAutomaton runs input through the automaton, in this case to match the input strings that are captured in the print statements at the end of Listing 5. Although this example is trivial, it lays the groundwork for understanding much more advanced capabilities that I’ll leave to readers to explore (along with DocValues) in the Lucene APIs. (See Resources for relevant links.).

Faceting

At its core, faceting generates a count of document attributes to give users an easy way to narrow down their search results without making them guess which keywords to add to the query. For example, if someone searches a shopping site for televisions, facets tell them how many TVs models are made by which manufacturers. Increasingly, faceting is also often used to power search-based business analytics and reporting tools. By using more-advanced faceting capabilities, you give users the ability to slice and dice facets in interesting ways.

Facets were long a hallmark of Solr (since version 1.1). Now Lucene has its own faceting module that stand-alone Lucene applications can leverage. Lucene’s faceting module it isn’t as rich in functionality as Solr’s, but it does offer some interesting tradeoffs. Lucene’s faceting module isn’t dynamic, in that you must make some faceting decisions at indexing time. But it is hierarchical, and it doesn’t have the cost of un-inverting fields into memory dynamically.

Listing 6 (part of the sample code’s FacetExamples.java file) showcases some of Lucene’s new faceting capabilities:

Listing 6. Lucene faceting examples
...
DirectoryTaxonomyWriter taxoWriter = 
     new DirectoryTaxonomyWriter(facetDir, IndexWriterConfig.OpenMode.CREATE);
FacetFields facetFields = new FacetFields(taxoWriter);
for (int i = 0; i < DOC_BODIES.length; i++) {
  String docBody = DOC_BODIES[i];
  String category = CATEGORIES[i];
  Document doc = new Document();
  CategoryPath path = new CategoryPath(category, '/');
  //Setup the fields
  facetFields.addFields(doc, Collections.singleton(path));//just do a single category path
  doc.add(new StringField("id", "id_" + i, Field.Store.YES));
  doc.add(new TextField("body", docBody, Field.Store.YES));
  writer.addDocument(doc);
}
writer.commit();
taxoWriter.commit();
DirectoryReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
ArrayList<FacetRequest> facetRequests = new ArrayList<FacetRequest>();
CountFacetRequest home = new CountFacetRequest(new CategoryPath("Home", '/'), 100);
home.setDepth(5);
facetRequests.add(home);
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Sports", '/'), 10));
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Weather", '/'), 10));
FacetSearchParams fsp = new FacetSearchParams(facetRequests);

FacetsCollector facetsCollector = FacetsCollector.create(fsp, reader, taxor);
searcher.search(new MatchAllDocsQuery(), facetsCollector);

for (FacetResult fres : facetsCollector.getFacetResults()) {
  FacetResultNode root = fres.getFacetResultNode();
  printFacet(root, 0);
}

The key pieces in Listing 6 that go beyond normal Lucene indexing and search are in the use of theFacetFieldsFacetsCollectorTaxonomyReader, and TaxonomyWriter classes. FacetFields creates the appropriate field entries in the document and works in concert with TaxonomyWriter at indexing time. At search time, TaxonomyReader works with FacetsCollector to get the appropriate counts for each category. Note, also, that Lucene’s faceting module creates a secondary index that, to be effective, must be kept in sync with the main index. Run the Listing 6 code by using the same command sequence you used for the earlier examples, except substitute FacetExamples for IndexingExamples in the java command. You should get:

Home (0.0)
 Home/Children (3.0)
  Home/Children/Nursery Rhymes (3.0)
 Home/Weather (2.0)

 Home/Sports (2.0)
  Home/Sports/Rock Climbing (1.0)
  Home/Sports/Hockey (1.0)
 Home/Writing (1.0)
 Home/Quotes (1.0)
  Home/Quotes/Yoda (1.0)
 Home/Music (1.0)
  Home/Music/Lyrics (1.0)
...

Notice that in this particular implementation I’m not including the counts for the Home facet, because including them can be expensive. That option is supported by setting up the appropriate FacetIndexingParams, which I’m not covering here. Lucene’s faceting module has additional capabilities that I’m not covering. I encourage you to explore them — and other new Lucene features that this article doesn’t touch on — by checking out the article Resources. And now, on to Solr 4.x.

Solr 4: Search and analytics at scale

From an API perspective, much of Solr 4.x looks and feels the same as previous versions. But 4.x contains numerous enhancements that make it easier to use, and more scalable, than ever. Solr also enables you to answer new types of questions, all while leveraging many of the Lucene enhancements that I just outlined. Other changes are geared toward the developer’s getting-started experience. For example, the all-new Solr Reference Guide (see Resources) provides book-quality documentation of every Solr release (starting with 4.4). And Solr’s new schemaless capabilities make it easy to add new data to the index quickly without first needing to define a schema. You’ll learn about Solr’s schemaless feature in a moment. First you’ll look at some of the new search, faceting, and relevance enhancements in Solr, some of which you saw in action in the Solr Air application.

Search, faceting, and relevance

Several new Solr 4 capabilities are designed to make it easier — on both the indexing side and the search-and-faceting side — to build next-generation data-driven applications. Table 1 summarizes the highlights and includes command and code examples when applicable:

Table 1. Indexing, searching, and faceting highlights in Solr 4
Name Description Example
Pivot faceting Gather counts for all of a facet’s subfacets, as filtered through the parent facet. See the Solr Air examplefor more details. Pivot on a variety of fields:
http://localhost:8983/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=&q=*:*  &facet.pivot=Origin,Dest,UniqueCarrier,FlightNum,ArrDelay&indent=true
New relevance function queries Access various index-level statistics such as document frequency and term frequency as part of a function query. Add the Document frequency for the term Origin:SFO to all returned documents:
http://localhost:8983/solr/collection1/travel?&wt=json&q=*:*&fl=*, {!func}docfreq('Origin',%20'SFO')&indent=true
Note that this command also uses the new DocTransformers capability.
Joins Represent more-complex document relationships and then join them at search time. More-complex joins are slated for future releases of Solr. Return only flights that have originating airport codes that appear in the Airport data set (and compare to the results of a request without the join):
http://localhost:8983/solr/collection1/travel?&wt=json&indent=true&q={!join%20from=IATA%20to=Origin}*:*
Codecsupport Change theCodec for the index and the postings format for individual fields. Use the SimpleTextCodec for a field:
<fieldType name="string_simpletext" postingsFormat="SimpleText" />
New update processors Use Solr’s Update Processor framework to plug in code to change documents before they are indexed but after they are sent to Solr.
  • Field mutating (for example, concatenate fields, parse numerics, trim)
  • Scripting. Use JavaScript or other code that’s supported by the JavaScript engine to process documents. See the update-script.js file in the Solr Air example.
  • Language detection (technically available in 3.5, but worth mentioning here) for identifying the language (such as English or Japanese) that’s used in a document.
Atomic updates Send in just the parts of a document that have changed, and let Solr take care of the rest. From the command line, using cURL, change the origin of document 243551 to be FOO:
curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [{"id": "243551","Origin": {"set":"FOO"}}]'

You can run the first three example commands in Table 1 in your browser’s address field (not the Solr Air UI) against the Solr Air demo data.

For more details on relevance functions, joins, and Codec — and other new Solr 4 features — see Resourcesfor relevant links to the Solr Wiki and elsewhere.

Scaling, NoSQL, and NRT

Probably the single most significant change to Solr in recent years was that building a multinode scalable search solution became much simpler. With Solr 4.x, it’s easier than ever to scale Solr to be the authoritative storage and access mechanism for billions of records — all while enjoying the search and faceting capabilities that Solr has always been known for. Furthermore, you can rebalance your cluster as your capacity needs change, as well as take advantage of optimistic locking, atomic updates of content, and real-time retrieval of data even if it hasn’t been indexed yet. The new distributed capabilities in Solr are referred to collectively as SolrCloud.

How does SolrCloud work? Documents that are sent to Solr 4 when it’s running in (optional) distributed mode are routed according to a hashing mechanism to a node in the cluster (called the leader). The leader is responsible for indexing the document into a shard. A shard is a single index that is served by a leader and zero or more replicas. As an illustration, assume that you have four machines and two shards. When Solr starts, each of the four machines communicates with the other three. Two of the machines are elected leaders, one for each shard. The other two nodes automatically become replicas of one of the shards. If one of the leaders fails for some reason, a replica (in this case the only replica) becomes the leader, thereby guaranteeing that the system still functions properly. You can infer from this example that in a production system enough nodes must participate to ensure that you can handle system outages.

To see SolrCloud in action, you can run launch a two-node, two-shard system by running the start-solr.sh script that you used in the Solr Air example with a -z flag. From the *NIX command line, first shut down your old instance:

kill -9 PROCESS_ID

Then restart the system:

bin/start-solr.sh -c -z

The -c flag erases the old index. The -z flag tells Solr to start up with an embedded version of Apache Zookeeper.

Apache Zookeeper

Zookeeper is a distributed coordination system that’s designed to elect leaders, establish a quorum, and perform other tasks to coordinate the nodes in a cluster. Thanks to Zookeeper, a Solr cluster never suffers from “split-brain” syndrome, whereby part of the cluster behaves independently of the rest of the cluster as the result of a partitioning event. See Resources to learn more about Zookeeper.

Point your browser at the SolrCloud admin page, http://localhost:8983/solr/#/~cloud, to verify that two nodes are participating in the cluster. You can now re-index your content, and it will be spread across both nodes. All queries to the system are also automatically distributed. You should get the same number of hits for a match-all-documents search against two nodes that you got for one node.

The start-solr.sh script launches Solr with the following command for the first node:

java -Dbootstrap_confdir=$SOLR_HOME/solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

The script tells the second node where Zookeeper is:

java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

Embedded Zookeeper is great for getting started, but to ensure high availability and fault tolerance for production systems, set up a stand-alone set of Zookeeper instances in your cluster.

Stacked on top the SolrCloud capabilities are support for NRT and many NoSQL-like functions, such as:

  • Optimistic locking
  • Atomic updates
  • Real-time gets (retrieving a specific document before it is committed)
  • Transaction-log-backed durability

Many of the distributed and NoSQL functions in Solr — such as automatic versioning of documents and transaction logs — work out of the box. For a few other features, the descriptions and examples in Table 2 will be helpful:

Table 2. Summary of distributed and NoSQL features in Solr 4
Name Description Example
Realtime get Retrieve a document, by ID, regardless of its state of indexing or distribution. Get the document whose ID is 243551:
http://localhost:8983/solr/collection1/get?id=243551
Shard splitting Split your index into smaller shards so they can be migrated to new nodes in the cluster. Split shard1 into two shards:
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1
NRT Use NRT to search for new content much more quickly than in previous versions. Turn on <autoSoftCommit> in your solrconfig.xml file. For example:
<autoSoftCommit>
<maxTime>5000</maxTime>
</autoSoftCommit>>
Document routing Specify which documents live on which nodes. Ensure that all of a user’s data is on certain machines. Read Joel Bernstein’s blog post (see Resources).
Collections Create, delete, or update collections as needed, programmatically, using Solr’s new collections API. Create a new collection named hockey:
http://localhost:8983/solr/admin/collections?action=CREATE&name=hockey&numShards=2

Going schemaless

Schemaless: Marketing hype?

Data collections rarely lack a schema. Schemaless is a marketing term that’s derived from a data-ingestion engine’s ability to react appropriately to the data “telling” the engine what the schema is — instead of the engine specifying the form that the data must take. For instance, Solr can accept JSON input and can index content appropriately on the basis of the schema that’s implicitly defined in the JSON. As someone pointed out to me on Twitter, less schema is a better term than schemaless, because you define the schema in one place (such as a JSON document) instead of two (such as a JSON document and Solr).

Based on my experience, in the vast majority of cases you should not use schemaless in a production system unless you enjoy debugging errors at 2 a.m. when your system thinks it has one type of data and in reality has another.

Solr’s schemaless functionality enables clients to add content rapidly without the overhead of first defining a schema.xml file. Solr examines the incoming data and passes it through a cascading set of value parsers. The value parsers guess the data’s type and then automatically add the fields to the internal schema and add the content to the index.

A typical production system (with some exceptions) shouldn’t use schemaless, because the value guessing isn’t always perfect. For instance, the first time Solr sees a new field, it might identify the field as an integer and thus define an integer FieldType in the underlying schema. But you may discover three weeks later that the field is useless for searching because the rest of the content that Solr sees for that field consists of float point values.

However, schemaless is especially helpful for early-stage development or for indexing content whose format you have little to no control over. For instance, Table 2 includes an example of using the collections API in Solr to create a new collection:

http://localhost:8983/solr/admin/collections?action=CREATE&name=hockey&numShards=2)

After you create the collection, you can use schemaless to add content to it. First, though, take a look at the current schema. As part of implementing schemaless support, Solr also added Representational State Transfer (REST) APIs for accessing the schema. You can see all of the fields defined for the hockey collection by pointing your browser (or cURL on the command line) at http://localhost:8983/solr/hockey/schema/fields. You see all of the fields from the Solr Air example. The schema uses those fields because the create option used my default configuration as the basis for the new collection. You can override that configuration if you want. (A side note: The setup.sh script that’s included in the sample code download uses the new schema APIs to create all of the field definitions automatically.)

To add to the collection by using schemaless, run:

bin/schemaless-example.sh

The following JSON is added to the hockey collection that you created earlier:

[
    {
        "id": "id1",
        "team": "Carolina Hurricanes",
        "description": "The NHL franchise located in Raleigh, NC",
        "cupWins": 1
    }
]

As you know from examining the schema before you added this JSON to the collection, the teamdescription, and cupWins fields are new. When the script ran, Solr guessed their types automatically and created the fields in the schema. To verify, refresh the results at http://localhost:8983/solr/hockey/schema/fields. You should now see teamdescription, andcupWins all defined in the list of fields.

Spatial (not just geospatial) improvements

Solr’s longstanding support for point-based spatial searching enables you to find all documents that are within some distance of a point. Although Solr supports this approach in an n-dimensional space, most people use it for geospatial search (for example, find all restaurants near my location). But until now, Solr didn’t support more-involved spatial capabilities such as indexing polygons or performing searches within indexed polygons. Some of the highlights of the new spatial package are:

  • Support through the Spatial4J library (see Resources) for many new spatial types — such as rectangles, circles, lines, and arbitrary polygons — and support for the Well Known Text (WKT) format
  • Multivalued indexed fields, which you can use to encode multiple points into the same field
  • Configurable precision that gives the developer more control over accuracy versus computation speed
  • Fast filtering of content
  • Query support for Is WithinContains, and IsDisjointTo
  • Optional support for the Java Topological Suite (JTS) (see Resources)
  • Lucene APIs and artifacts

The schema for the Solr Air application has several field types that are set up to take advantage of this new spatial functionality. I defined two field types for working with the latitude and longitude of the airport data:

<fieldType name="location_jts" 
distErrPct="0.025" spatialContextFactory=
"com.spatial4j.core.context.jts.JtsSpatialContextFactory" 
maxDistErr="0.000009" units="degrees"/>

<fieldType name="location_rpt" 
distErrPct="0.025" geo="true" maxDistErr="0.000009" units="degrees"/>

The location_jts field type explicitly uses the optional JTS integration to define a point, and thelocation_rpt field type doesn’t. If you want to index anything more complex than simple rectangles, you need to use the JTS version. The fields’ attributes help to define the system’s accuracy. These attributes are required at indexing time because Solr, via Lucene and Spatial4j, encodes the data in multiple ways to ensure that the data can be used efficiently at search time. For your applications, you’ll likely want to run some tests with your data to determine the tradeoffs to make in terms of index size, precision, and query-time performance.

In addition, the near query that’s used in the Solr Air application uses the new spatial-query syntax (IsWithinon a Circle) for finding airports near the specified origin and destination airports.

New administration UI

In wrapping up this section on Solr, I would be remiss if I didn’t showcase the much more user-friendly and modern Solr admin UI. The new UI not only cleans up the look and feel but also adds new functionality for SolrCloud, document additions, and much more.

For starters, when you first point your browser at http://localhost:8983/solr/#/, you should see a dashboard that succinctly captures much of the current state of Solr: memory usage, working directories, and more, as in Figure 7:

Figure 7. Example Solr dashboard

Screen capture of an example Solr dashboard

If you select Cloud in the left side of the dashboard, the UI displays details about SolrCloud. For example, you get in-depth information about the state of configuration, live nodes, and leaders, as well as visualizations of the cluster topology. Figure 8 shows an example. Take a moment to work your way through all of the cloud UI options. (You must be running in SolrCloud mode to see them.)

Figure 8. Example SolrCloud UI

Screen capture of a SolrCloud UI example

The last area of the UI to cover that’s not tied to a specific core/collection/index is the Core Admin set of screens. These screens provides point-and-click control over the administration of cores, including adding, deleting, reloading, and swapping cores. Figure 9 shows the Core Admin UI:

Figure 9. Example of Core Admin UI

Screen capture of the core Solr admin UI

By selecting a core from the Core list, you access an overview of information and statistics that are specific to that core. Figure 10 shows an example:

Figure 10. Example core overview

Screen capture of a core overview example in the Solr UI

Most of the per-core functionality is similar to the pre-4.x UI’s functionality (albeit in a much more pleasant way), with the exception of the Documents option. You can use the Documents option to add documents in various formats (JSON, CSV, XML, and others) to the collection directly from the UI, as Figure 11:

Figure 11. Example of adding a document from the UI

Screen capture from the Solr UI that shows a JSON document being added to a collection

You can even upload rich document types such as PDF and Word. Take a moment to add some documents into your index or browse the other per-collection capabilities such as the Query interface or the revampedAnalysis screen.

The road ahead

Next-generation search-engine technology gives users the power to decide what to do with their data. This article gave you a good taste of what Lucene and Solr 4 are capable of, and, I hope, a broader sense of how search engines solve non-text-based search problems that involve analytics and recommendations.

Lucene and Solr are in constant motion, thanks to a large sustaining community that’s backed by more than 30 committers and hundreds of contributors. The community is actively developing two main branches: the current officially released 4.x branch and the trunk branch, which represents the next major (5.x) release On the official release branch, the community is committed to backward compatibility and an incremental approach to development that focuses on easy upgrades of current applications. On the trunk branch, the community is a bit less restricted in terms of ensuring compatibility with previous releases. If you want to try out the cutting edge in Lucene or Solr, check the trunk branch of the code out from Subversion or Git (see Resources). Whichever path you choose, you can take advantage of Lucene and Solr for powerful search-based analytics that go well beyond plain text search.

Acknowledgments

Thanks to David Smiley, Erik Hatcher, Yonik Seeley, and Mike McCandless for their help.

(source IBM.com)

 

How to Set Up The Ampache Streaming Music Server In Ubuntu

Prerequisites

To follow this tutorial you will need:

  1. A PC with Ubuntu 12.04 LTS running LAMP
  2. Your own web address (optional – required for streaming music to external clients like your work computer or cell phone)
  3. Forward port 80 from your router to your ubuntu server (optional – required for #2 above)
  4. SAMBA/SWAT running on your server.

Each of these can be done for free by following each of the three nice tutorials linked below.

  1. http://www.ubuntugeek.com/step-by-step-ubuntu-12-04-precise-lamp-server-setup.html
  2. http://www.howtoforge.com/how-to-install-no-ip2-on-ubuntu-12.04-lts-in-order-to-host-servers-on-a-dynamic-ip-address
  3. http://www.howtoforge.com/How-to-forward-ports-to-your-ubuntu-12.04-LTS-LAMP-server
  4. http://askubuntu.com/questions/196272/how-to-install-and-configure-swat-in-ubuntu-server-12-04

NB: The very last step of (4) is to login as a “member of the admin group” – you can ignore this and login to SWAT later when instructed to do so in this tutorial when you actually come to use it.

O.K. I will assume you followed these 4 guides and you now have an ubuntu 12.04 LAMP system running SAMBA/SWAT and a no-ip website address with port 80 forwarded to the ubuntu server.

Create Media directories

Create your media directory and a download directory and give them quite permissive permissions so anyone can access the folders on your network. Please note that you need to exchange <ubuntu username> for your own ubuntu username.

sudo mkdir ~/music
sudo chmod 777 ~/music
sudo mkdir ~/downloads
sudo chmod 777 ~/downloads

The permissions are set with the 777 code and it means that anyone with access to the system can read, edit, and run the files contained within. You could set more restrictive permissions here but I prefer easy access to my media folders.

Setup windows folder sharing using SWAT

On a web browser log in to SWAT as the admin: go to: http://<Ubuntu server hostname>:901 eg: http://amapche:901 Now, login with user name: “root” and your <root user password>.

Click on the shares box at the top. Now click “create share”.

Enter the following into their respective boxes. Please remember to exchange <ubuntu username> to your own ubuntu username.

path:        /home/<ubuntu username>/music
valid users:    <ubuntu username>
read only:    no
available:    yes

Now click commit changes. Now click advanced and set all the “masks” to 0777.

Click “commit changes” again.

Now click basic and “create share”. Repeat the process for your “downloads” folder.

You should now see your ubuntu server on the network and have access to the two shared folders you created provided you remember the <SAMBA user password> you set. Again please note this is quite permissive and gives everyone on the network easy access to the music and downloads folder provided they know the <SAMBA user password> when they try to access the folders.

Now you are ready to download and install ampache. At this point i like to start the copying of my music over to the shared music folder as this will take some time. Hopefully it will be done by the time I’m done installing ampache.

Install Ampache

Download and unpack ampache

Go to your putty terminal and enter

cd ~/downloads

Go to  https://github.com/ampache/ampache/tags right click the latest tar.gz link and copy the link to the clipboard and paste it into the terminal as in the following line (after typing “sudo wget” then add the “-O ampache.tar.gz” part):

sudo wget https://github.com/ampache/ampache/archive/3.6-alpha6.tar.gz -O ampache.tar.gz

Untar the tarball into an appropriate folder:

sudo mkdir /usr/local/src/www
sudo chmod 7777 /usr/local/src/www
sudo tar zxvf ampache.tar.gz -C /usr/local/src/www

Note the name of the root folder eg ampache-3.6-alpha6 use the noted name where you see this in the following text.

Relax the permissions on the extracted folder:

 sudo chmod -R 7777 /usr/local/src/www/ampache-3.6-alpha6

For securities sake, we will, once installation is complete give ownership of the extracted folder to the web-server and tighten permissions up again.

Enable php-gd in apache web server to allow for resizing album art:

sudo apt-get install php5-gd
sudo /etc/init.d/apache2 restart

Create a link from the web-server root folder to the extracted ampache site:

cd /var/www/
sudo ln -s /usr/local/src/www/ampache-3.6-alpha6 ampache

Doing it this way allows us to move websites around and rename them with ease.

Online initial ampache configuration

Go back to your web browser and go to: http://<Ubuntu server hostname>/ampache

If all went well so far you should see the start of the installation process. Note that since you used ubuntu 12.04 LAMP server all the little OK’s are nice and green indicating that the system is ready to have ampache installed.

Click to start the configuration and fill in the boxes as follows

  • Desired Database Name        – ampache
  • MySQL Hostname             – localhost
  • something else I forget        – <leave blank>
  • MySQL Administrative Username     – root
  • MySQL Administrative Password     – <mySQL root password>
  • create database user for new database [check]
  • Ampache database username     – <ampache database user name> eg ampache
  • Ampache database User Password     – <ampache database password> eg ampachexyz123blahblah6htYd4
  • overwrite existing [unchecked]
  • use existing database [unchecked]

Here I recommend that you do NOT use your <ubuntu username> as the <ampache database user name> or the <ubuntu user password> as the <ampache database password> for simplicity. I recommend you use a new username and password here as they will be stored in clear text inside your config file and anyone reading them could gain control over the whole system if you used the ubuntu username and password here.
You only need to remember the database username and password for the next step. They are for ampache to use not you.

Click “insert database” and then fill in the boxes for the next section as follows:

  • Web Path         – /ampache
  • Database Name         – ampache
  • MySQL Hostname         – localhost
  • MySQL port (optional)     – <leave empty>
  • MySQL Username         – <ampache database user name>     or root
  • MySQL Password         – <ampache database password> or mysql root password

Click “write”.

Note that the red words turn into green OK’s.

Click to continue to step 3.

Create the admin user passwords
Ampache admin username    – <apache admin username>
Ampache admin password    – <Ampache admin password>
Repeat the password

Here I usually use a password I can easily remember as I will need to use this to access the ampache site whenever I want to use it.

Click to update the database.

Click return.

The main ampache login screen should open and you can log in with the ampache admin username and password you just set.
DONT forget this password as its quite a pain to reset it.

Now you should see the main ampache interface. Ampache is up and running but needs some more specific configuration before you can use it.

Ampache user specific configuration

Back to the putty terminal.

First we’ll create a space to drop the error logs so you can locate any issues later:

sudo mkdir /var/log/ampache
sudo chmod 7777 /var/log/ampache

Now we’ll create a temp directory to store zip files for users to download songs, albums, and playlists.

sudo mkdir /ziptemp
sudo chmod 7777 /ziptemp

Now we will start editing the ampache config file:

sudo nano /usr/local/src/www/ampache-3.6-alpha6/config/ampache.cfg.php

Scroll slowly down through the file and edit each set of the appropriate lines to read as follow. I have titled the sets to help split up their use.
To speed things up You can use ctrl-w to search for specific config parameters

Network access – allow access from offsite:

require_localnet_session = "false"
access_control  = "true"

Allow zip downloads of songs/albums/playlists etc:

allow_zip_download = "true"
file_zip_download = "true"
file_zip_path = "/ziptemp"
memory_limit = 128

Aesthetic improvement:

resize_images = "true"

Allow comprehensive debugging:

debug = "true"
debug_level = 5
log_path = "/var/log/ampache"

Transcoding settings

Arguably transcoding is simultaneously the most powerful and most frustrating feature of an ampache install but please bare with me because you’ll be thankfull you got it to work. I’m going to explain a little about transcoding then I’m going to give MY version of the configuration file lines as they are in my file and the commands you need to run to install the additional transcoding software that’s required.

The way transcoding works is that ampache takes the original file, which could be in any format, and feeds it to a transcoding program such as avconv, which needs to be installed separately, along with a selection of program arguments depending on how the transcoding needs to be done and into what format the final file needs to be. The transcoding program then attempts to convert the file from one format to the next using the appropriate codec which also needs to be separately installed.

Since the ampache team has no control over the programs and codecs that carry this out there is only minimal help available for the correct syntax of the transcoding commands. I think the idea is that you should read the manual that comes with each program… Anyway, if any of your music is stored in anything other than mp3 you will need to get transcoding to work. Firstly the transcode command tells ampache the main file to use to start the transcode process. Next the encode args are what ampache adds to the main transcode command in order to modify the output format to suit the requested output format. With the new HTML5 player different browsers will request output in different formats for example, Firefox will request ogg, and chrome will request mp3 hence your encode args lines must be correct for both requested formats if you expect the player to work on different systems. At the time of writing the latest version of explorer was not capible of using HTML5 players, the bug is at there end but hopefully they will fix it soon. Until then you will need to use firefox or chrome to access the HTML5 player.

Here are the config lines for my transcode settings you will need to find the lines in the config php file as before and edit them to look as follows:

max_bit_rate = 576
min_bit_rate = 48
transcode_flac = required
transcode_mp3 = allowed
encode_target = mp3
transcode_cmd = "avconv -i %FILE%"
encode_args_ogg = "-vn -b:a max\(%SAMPLE%K\,49K\) -acodec libvorbis -vcodec libtheora -f ogg pipe:1"
encode_args_mp3 = "-vn -b:a %SAMPLE%K -acodec libmp3lame -f mp3 pipe:1"
encode_args_ogv = "-vcodec libtheora -acodec libvorbis -ar 44100 -f ogv pipe:1"
encode_args_mp4 = "-profile:0 baseline -frag_duration 2 -ar 44100 -f mp4 pipe:1"

I may come back and edit these later to improve video transcoding but for now they’ll do for music at least.

You may note the strange encode_args_ogg line. This is because at the time of writing the libvorbis library cant handle outbit sample rates less than 42 or so. For this reason I’ve added a bit to the line to force at least 42K bit rate.

OK save and exit the config file and we will install the missing programs and codec libraries.

If at any time you think you’ve ruined any particular config parameter you can open and check the distribution config file.
Do this with the command

sudo nano /usr/local/src/www/ampache-3.6-alpha6/config/ampache.cfg.php.dist

Transcoding software

Add the repository of multimedia libraries and programs:

sudo nano  /etc/apt/sources.list

Add the following two lines to your sources file:

deb http://download.videolan.org/pub/debian/stable/ /
deb-src http://download.videolan.org/pub/debian/stable/ /

Save and quit (ctrl -x then hit y).

Add the repository:

wget -O – http://download.videolan.org/pub/debian/videolan-apt.asc|sudo apt-key add -

Install the compiler and package programs:

sudo apt-get -y install build-essential checkinstall cvs subversion git git-core mercurial pkg-config apt-file

Install the audio libraries required to build the avconv audio transcoding:

sudo apt-get -y install lame libvorbis-dev vorbis-tools flac libmp3lame-dev libavcodec-extra*

Install the video libraries required for avconv video transcoding:

sudo apt-get install libfaac-dev libtheora-dev libvpx-dev

At this stage support for video cataloging and streaming in ampache is limited so I have focussed mostly on audio transcoding options with the view to incorporate video transcoding when support within ampache is better. I’ve included some video components as I get them to work.

Now we’re going to try and install the actual transcoding binary “avconv” which is preferred over ffmpeg in ubuntu.
This part will seem pretty deep and complicated but the processes are typical for installing programs from source files in ubuntu.
First you create a directory to store the source files, then you download the source files from the internet, then you configure the compiler with some settings, then you make the program and its libraries into a package, then you install the package.

Note that the “make” process is very time consuming (and scary looking) particularly for avconv. Just grap a cup of coffee and watch the fun unfold until the command line appears waiting for input from you again.
Note also that some online guides suggest “make install” rather than “checkinstall”. In ubuntu I prefer checkinstall as this process generates a simple uninstall process and optionally allows you to easily specify the package dependancies if you so choose enabling you to share the packages with others.

(required) Get the architecture optimised compiler yasm

sudo mkdir /usr/local/src/yasm
cd ~/downloads
wget http://www.tortall.net/projects/yasm/releases/yasm-1.2.0.tar.gz  

You could also go to http://www.tortall.net/projects/yasm/releases/ and check that there isnt a version newer than 1.2.0 if you so choose and replace the link above with its link.

Unzip, configure, build, and install yasm:

tar xzvf yasm-1.2.0.tar.gz
sudo mv yasm-1.2.0 /usr/local/src/yasm/
cd /usr/local/src/yasm/yasm-1.2.0
sudo ./configure
sudo make
sudo checkinstall

Hit enter to select y to create the default docs.
Enter yasm when requested as the name for the installed program package.
Hit enter again and then again to start the process.

(optional) Get the x264 codec – required to enable on-the-fly video transcoding

sudo mkdir /usr/local/src/x264
cd /usr/local/src/
udo git clone git://git.videolan.org/x264.git x264
$cd /usr/local/src/x264
sudo ./configure –enable-shared –libdir=/usr/local/lib/
sudo make
sudo checkinstall

Hit enter to select y and create the default docs.
Enter x264 when requested as the name for the install.
Hit enter to get to the values edit part.
Select 3 to change the version.
Call it version 1.
Then hit enter again and again to start the process.

(Required) Get the avconv transcoding program

cd /usr/local/src/
sudo git clone git://git.libav.org/libav.git avconv
cd /usr/local/src/avconv
sudo LD_LIBRARY_PATH=/usr/local/lib/ ./configure –enable-gpl –enable-libx264 –enable-nonfree –enable-shared –enable-libmp3lame –enable-libvorbis –enable-libtheora –enable-libfaac –enable-libvpx > ~/avconv_configuration.txt

Note the bunch of settings we have set in build configuration – the most important for audio transcoding are –enable-libmp3lame –enable-libvorbis
libmp3lame allows us to transcode to mp3 and libvorbis allows us to transcode to ogg, which is required for the firefox HTML5 player.

Now we’ll build avconv which takes ages. I’ve added the switch -j14 to runn multiple jobs during make. You may or may not have better luck with different values after the j depending on your own cpu architecture. 14 was best for me with my dual core hyperthreaded 8Gb RAM machine.

sudo make -j14
sudo checkinstall

Hit enter to select y to create the default docs.
Enter avconv when requested as the name for the installed program package.
Hit enter again and then again to start the process.

sudo ldconfig
sudo /etc/init.d/apache2 restart

Set up API/RPC access for external client program access

Next, in order that your iphone, android phone, or pimpache device  can access the music we need to set ACL/RPC and streaming permissions. NB: pimpache is not a typo but a “raspberry pi ampache client” project. It’s currently under construction and located on github under “PiPro”.

Go to a browser and login to ampache.

First we’ll create an external user. In the web interface click the gray “admin” filing cabinet on the left.
Click the “add user” link.
Create a username such as <external user>.
Give it a GOOD password <external user password> but also remember that you may well be entering the password for it on your phone.
Set user access to “user”.
Click “add user”.
Click continue.

You can create users for different situations as well as different people. For example I have a “work” user with a lower default transcoding bit rate and a home user with a very high transcoding default bit rate.
Set the transcoding bitrate for a given user by going to the admin section (gray filing cabinet) then clicking “browse users” then the “preferences” icon.
The default transcoding bitrate is then set in the “transcode bitrate” section.
Logged in users may also set there own default transcode bit rate by clicking the “preferences” icon in the menu tab then streaming.

Now we’ll add the API/RPC so the external user can get at the catalog.

In the web interface click the gray “admin” filing cabinet on the left.
Click the “show acls” link.
Now click the “add API/RPC host” link.
Give it an appropriate name such as “external” (no quotes).

  • level “read/write”
  • user external
  • acl type API/RPC
  • start 1.1.1.1 end 255.255.255.255

Click create ACL.
Click continue.
Now click the “add API/RPC host” link AGAIN.
Give it an appropriate name such as “external” (no quotes).

  • level “read/write”
  • user external
  • acl type API/RPC + Stream access
  • start 1.1.1.1 end 255.255.255.255

Click create ACL.
You may get the message “DUPLICATE ACL DEFINED”.
That’s ok.
Now click show ACLS and you should see two listed called external with the settings above – one for API/RPC and one for streaming.
You may see 4, thats ok as long ans external (or all) can get streaming and API access).

Now you can install and use many and varied clients to attach to your ampache server and start enjoying its musical goodness.
NB: The phones, viridian clients, etc will ask for a username, password and ampache url. Above, you setup an external username and password to use in just such a situation and the URL is the URL to your server from the outside world + /ampache.
eg: http://mypersonalampachesite.no-ip.biz/ampache

If you want to use an internal network address you may need to specify the ip address rather than the server host name depending on your router DNS system.

Catalog  setup

Finally after all that your music will hopefully have finished copying over and you can create your ampache music catalog.

Click on the gray “admin” filing cabinet in your ampache website.
Click “add a catalog”.
Enter a simple name for your catalog eg “home”.
Enter the path to the music folder you’ve been copying your files across to ie if you created a shared folder and copied your music to it as described above the path is: “/home/<ubuntu username>/music”
(watch out for capital letters here, swap out the <ubuntu username> for the username you used, and dont use the quotes ” ).
Set the “Level catalog type” as local but note that you could potentially chain other catalogs to yours – how cool is that.
If you keep your music organised like I do then leave the filename and folder patterns as is.
Click to “gather album art” and “build playlists” then click “add catalog”.

Final security tidyup

In the putty terminal. We’ll restrict permissions to the extracted ampache folder to protect it from malicious software/people.

sudo chown www-data:www-data /usr/local/src/www/ampache-3.6-alpha6
sudo chmod -R 0700 /usr/local/src/www/ampache-3.6-alpha6/

That should do it.
If you need to move around in that directory again for some reason you will need to make the permissions more relaxed.
You can do this with

sudo chmod -R 0777 /usr/local/src/www/ampache-3.6-alpha6/

Dont forget to do

sudo chmod -R 0700 /usr/local/src/www/ampache-3.6-alpha6/

after to tighten it up again.

Now, go and amaze your friends and family by streaming YOUR music to THEIR PC or phone.

Troubleshooting

The best help I can offer is to look inside the log file if you followed the how-to above then you can find the log file at /var/log/ampache/.

cd /var/log/ampache
dir

Note the latest file name then access the log file with (for example):

 sudo nano /var/log/ampache/yyyymmdd.ampache.log

Please, if you have any advice re: transcoding commands feel free to leave helpful comments. I think help on this is really hard to come by.

For starters the best info I can find for avconv in general is at http://libav.org/avconv.html.

If you get permission errors when trying to copy to the music folder try again to relax the permissions on this folder with

sudo chmod 777 ~/music

Messed it up and want to start again from scratch?

Instead of reinstalling  ubuntu and LAMP and SAMBA you can delete ampache and its database with:

sudo mysql -u root -p

Enter your mysql password:

drop database ampache;

quit

sudo rm -R /usr/local/src/www/ampache-3.6-alpha6
cd ~/downloads
sudo tar zxvf ampache.tar.gz -C /usr/local/src/www
sudo chmod -R 7777 /usr/local/src/www/ampache-3.6-alpha6  

That should get you reset with all the default settings and ready to try again from the intial web logon.

(source Howtoforge.com)

Sean Hull’s 20 Biggest Bottlenecks That Reduce And Slow Down Scalability

This article is a lightly edited version of 20 Obstacles to Scalability by Sean Hull (with permission) from the always excellent and thought provoking ACM Queue.

1. TWO-PHASE COMMIT

Normally when data is changed in a database, it is written both to memory and to disk. When a commit happens, a relational database makes a commitment to freeze the data somewhere on real storage media. Remember, memory doesn’t survive a crash or reboot. Even if the data is cached in memory, the database still has to write it to disk. MySQL binary logs or Oracle redo logs fit the bill.

With a MySQL cluster or distributed file system such as DRBD (Distributed Replicated Block Device) or Amazon Multi-AZ (Multi-Availability Zone), a commit occurs not only locally, but also at the remote end. A two-phase commit means waiting for an acknowledgment from the far end. Because of network and other latency, those commits can be slowed down by milliseconds, as though all the cars on a highway were slowed down by heavy loads. For those considering using Multi-AZ or read replicas, the Amazon RDS (Relational Database Service) use-case comparison at http://www.iheavy.com/2012/06/14/rds-or-mysql-ten-use-cases/ will be helpful.

Synchronous replication has these issues as well; hence, MySQL’s solution is semi-synchronous, which makes some compromises in a real two-phase commit.

2. INSUFFICIENT CACHING

Caching is very important at all layers, so where is the best place to cache: at the browser, the page, the object, or the database layer? Let’s work through each of these.

Browser caching might seem out of reach, until you realize that the browser takes directives from the Web server and the pages it renders. Therefore, if the objects contained therein have longer expire times, the browser will cache them and will not need to fetch them again. This is faster not only for the user, but also for the servers hosting the Web site, as all returning visitors will weigh less.

More information about browser caching is available at http://www.iheavy.com/2011/11/01/5-tips-cache-websites-boost-speed/. Be sure to set expire headers and cache control.

Page caching requires using a technology such as Varnish (https://www.varnish-cache.org/). Think of this as a mini Web server with high speed and low overhead. It can’t handle as complex pages as Apache can, but it can handle the very simple ones better. It therefore sits in front of Apache and reduces load, allowing Apache to handle the more complex pages. This is like a traffic cop letting the bikes go through an intersection before turning full attention to the more complex motorized vehicles.

Object caching is done by something like memcache. Think of it as a Post-it note for your application. Each database access first checks the object cache for current data and answers to its questions. If it finds the data it needs, it gets results 10 to 100 times faster, allowing it to construct the page faster and return everything to the user in the blink of an eye. If it doesn’t find the data it needs, or finds only part of it, then it will make database requests and put those results in memcache for later sessions to enjoy and benefit from.

3. SLOW DISK I/O, RAID 5, MULTITENANT STORAGE

Everything, everything, everything in a database is constrained by storage—not by the size or space of that storage but by how fast data can be written to those devices.

If you’re using physical servers, watch out for RAID 5, a type of RAID (redundant array of independent disks) that uses one disk for both parity and protection. It comes with a huge write penalty, however, which you must always carry. What’s more, if you lose a drive, these arrays are unusably slow during rebuild.

The solution is to start with RAID 10, which gives you striping over mirrored sets. This results in no parity calculation and no penalty during a rebuild.

Cloud environments may work with technology such as Amazon EBS (Elastic Block Store), a virtualized disk similar to a storage area network. Since it’s network based, you must contend and compete with other tenants (aka customers) reading and writing to that storage. Further, those individual disk arrays can handle only so much reading and writing, so your neighbors will affect the response time of your Web site and application.

Recently Amazon rolled out a badly branded offering called Provisioned IOPS (I/O operations per second). That might sound like a great name to techies, but to everyone else it doesn’t mean anything noteworthy. It’s nonetheless important. It means you can lock in and guarantee the disk performance your database is thirsty for. If you’re running a database on Amazon, then definitely take a look at this.

4. SERIAL PROCESSING

When customers are waiting to check out in a grocery store with 10 cash registers open, that’s working in parallel. If every cashier is taking a lunch break and only one register is open, that’s serialization. Suddenly a huge line forms and snakes around the store, frustrating not only the customers checking out, but also those still shopping. It happens at bridge tollbooths when not enough lanes are open, or in sports arenas when everyone is leaving at the same time.

Web applications should definitely avoid serialization. Do you see a backup waiting for API calls to return, or are all your Web nodes working off one search server? Anywhere your application forms a line, that’s serialization and should be avoided at all costs.

5. MISSING FEATURE FLAGS

Developers normally build in features and functionality for business units and customers. Feature flags are operational necessities that allow those features to be turned on or off in either back-end config files or administration UI pages.

Why are they so important? If you have ever had to put out a fire at 4 a.m., then you understand the need for contingency plans. You must be able to disable ratings, comments, and other auxiliary features of an application, just so the whole thing doesn’t fall over. What’s more, as new features are rolled out, sometimes the kinks don’t show up until a horde of Internet users hit the site. Feature flags allow you to disable a few features, without taking the whole site offline.

6. SINGLE COPY OF THE DATABASE

You should always have at least one read replica or MySQL slave online. This allows for faster recovery in the event that the master fails, even if you’re not using the slave for browsing—but you should do that, too, since you’re going to build a browse-only mode, right?

Having multiple copies of a database suggests horizontal scale. Once you have two, you’ll see how three or four could benefit your infrastructure.

7. USING YOUR DATABASE FOR QUEUING

A MySQL database server is great at storage tables or data, and relationships between them. Unfortunately, it’s not great at serving as a queue for an application. Despite this, a lot of developers fall into the habit of using a table for this purpose. For example, does your app have some sort of jobs table, or perhaps a status column, with values such as “in-process,” “in-queue,” and “finished”? If so, you’re inadvertently using tables as queues.

Such solutions run into scalability hang-ups because of locking challenges and the scan and poll process to find more work. They will typically slow down a database. Fortunately, some good open source solutions are available, such as RabbitMQ (http://www.rabbitmq.com/) or Amazon’s SQS (Simple Queue Service;http://aws.amazon.com/sqs/).

8. USING A DATABASE FOR FULL-TEXT SEARCHING

Page searching is another area where applications get caught. Although MySQL has had full-text indexes for some time, they have worked only with MyISAM tables, the legacy table type that is not crash-proof, not transactional, and just an all-around headache for developers.

One solution is to go with a dedicated search server such as Solr (http://lucene.apache.org/solr/). These servers have good libraries for whatever language you’re using and high-speed access to search. These nodes also scale well and won’t bog down your database.
Alternatively, Sphinx SE, a storage engine for MySQL, integrates the Sphinx server right into the database. If you’re looking on the horizon, Fulltext is coming to InnoDB, MySQL’s default storage engine, in version 5.6 of MySQL.

9. OBJECT RELATIONAL MODELS

The ORM (object relational model), the bane of every Web company that has ever used it, is like cooking with MSG. Once you start using it, it’s hard to wean yourself off.

The plus side is that ORMs help with rapid prototyping and allow developers who aren’t SQL masters to read and write to the database as objects or memory structures. They are faster, cleaner, and offer quicker delivery of functionality—until you roll out on servers and want to scale.

Then your DBA (database administrator) will come to the team with a very slow-running, very ugly query and say, “Where is this in the application? We need to fix it. It needs to be rewritten.” Your dev team then says, “We don’t know!” And an incredulous look is returned by the ops team.

The ability to track down bad SQL and root it out is essential. Bad SQL will happen, and your DBA team will need to index properly. If queries are coming from ORMs, they don’t lend themselves to all of this. Then you’re faced with a huge technical debt and the challenge of ripping and replacing.

10. MISSING INSTRUMENTATION

Instrumentation provides speedometers and fuel needles for Web applications. You wouldn’t drive a car without them, would you? They expose information about an application’s internal workings. They record timings and provide feedback about where an application spends most of its time.

One very popular Web services solution is New Relic (http://newrelic.com/), which provides visual dashboards that appeal to everyone—project managers, developers, the operations team, and even business units can all peer at the graphs and see what’s happening.

Some open source instrumentation projects are also available.

11. LACK OF A CODE REPOSITORY AND VERSION CONTROL

Though it’s rare these days, some Internet companies do still try to build software without version control. Those who use it, however, know the everyday advantage and organizational control it provides for a team.

If you’re not using it, you are going to spiral into technical debt as your application becomes more complex. It won’t be possible to add more developers and work on different parts of your architecture and scaffolding.

Once you starting using version control, be sure to get all the components in there, including configuration files and other essentials. Missing pieces that have to be located and tracked down at deployment time become an additional risk.

12. SINGLE POINTS OF FAILURE

If your data is on a single master database, that’s a single point of failure. If your server is sitting on a single disk, that’s a single point of failure. This is just technical vernacular for an Achilles heel.

These single points of failure must be rooted out at all costs. The trouble is recognizing them. Even relying on a single cloud provider can be a single point of failure. Amazon’s data center or zone failures are a case in point. If it had had multiple providers or used Amazon differently, AirBNB would not have experienced downtime when part of Amazon Web Services went down in October 2012(http://www.iheavy.com/2012/10/23/airbnb-didnt-have-to-fail/).

13. LACK OF BROWSE-ONLY MODE

If you’ve ever tried to post a comment on Yelp, Facebook, or Tumblr late at night, you’ve probably gotten a message to the effect of, “This feature isn’t available. Try again later.” “Later” might be five minutes or 60 minutes. Or maybe you’re trying to book airline tickets and you have to retry a few times. To nontechnical users, the site still appears to be working normally, but it just has this strange blip.

What’s happening here is that the application is allowing you to browse the site, but not make any changes. It means the master database or some storage component is offline.

Browse-only mode is implemented by keeping multiple read-only copies of the master database, using a method such as MySQL replication or Amazon read replicas. Since the application will run almost fully in browse mode, it can hit those databases without the need for the master database. This is a big, big win.

14. WEAK COMMUNICATION

Communication may seem a strange place to take a discussion on scalability, but the technical layers of Web applications cannot be separated from the social and cultural ones that the team navigates.

Strong lines of communication are necessary, and team members must know whom to go to when they’re in trouble. Good communication demands confident and knowledgeable leadership, with the openness to listen and improve.

15. LACK OF DOCUMENTATION

Documentation happens at a lot of layers in a Web application. Developers need to document procedures, functions, and pages to provide hints and insight to future generations looking at that code. Operations teams need to add comments to config files to provide change history and insight when things break. Business processes and relationships can and should be documented in company wikis to help people find their own solutions to problems.

Documentation helps at all levels and is a habit everyone should embrace.

16. LACK OF FIRE DRILLS

Fire drills always get pushed to the backburner. Teams may say, “We have our backups; we’re covered.” True, until they try to restore those backups and find that they’re incomplete, missing some config file or crucial piece of data. If that happens when you’re fighting a real fire, then something you don’t want will be hitting that office fan.

Fire drills allow a team to run through the motions, before they really need to. Your company should task part of its ops team with restoring its entire application a few times a year. With AWS and cloud servers, this is easier than it once was. It’s a good idea to spin up servers just to prove that all your components are being backed up. In the process you will learn how long it takes, where the difficult steps lie, and what to look out for.

17. INSUFFICIENT MONITORING AND METRICS

Monitoring falls into the same category of best practices as version control: it should be so basic you can’t imagine working without it; yet there are Web shops that go without, or with insufficient monitoring—some server or key component is left out

Collecting this data over time for system and server-level data, as well as application and business- level availability, are equally important. If you don’t want to roll your own, consider a Web services solution to provide your business with real uptime.

18. COWBOY OPERATIONS

You roll into town on a fast horse, walk into the saloon with guns blazing, and you think you’re going to make friends? Nope, you’re only going to scare everyone into complying but with no real loyalty. That’s because you’ll probably break things as often as you fix them. Confidence is great, but it’s best to work with teams. The intelligence of the team is greater than any of the individuals.

Teams need to communicate what they’re changing, do so in a managed way, plan for any outage, and so forth. Caution and risk aversion win the day. Always have a plan B. You should be able to undo the change you just made, and be aware which commands are destructive and which ones cannot be undone.

19. GROWING TECHNICAL DEBT

As an app evolves over the years, the team may spend more and more time maintaining and supporting old code, squashing bugs, or ironing out kinks. Therefore, they have less time to devote to new features. This balance of time devoted to debt servicing versus real new features must be managed closely. If you find your technical debt increasing, it may be time to bite the bullet and rewrite. Rewriting will take time away from the immediate business benefit of new functionality and customer features, but it is best for the long term.

Technical debt isn’t always easy to recognize or focus on. As you’re building features or squashing bugs, you’re more attuned to details at the five-foot level. It’s easy to miss the forest for the trees. That’s why generalists are better at scaling the Web (http://www.iheavy.com/2011/10/25/why-generalists-better-scaling-web/).

20. INSUFFICIENT LOGGING

Logging is closely related to metrics and monitoring. You may enable a lot more of it when you’re troubleshooting and debugging, but on an ongoing basis you’ll need it for key essential services. Server syslogs, Apache and MySQL logs, caching logs, etc., should all be working. You can always dial down logging if you’re getting too much of it, or trim and rotate log files, discarding the old ones.

(source: HighScalability.com)

Paper: MegaPipe: A New Programming Interface For Scalable Network I/O

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (videoslides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that’s why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc.  What they created is MegaPipe, “a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API.”

The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

What’s this most excellent and interesting paper about?

Message-oriented network workloads, where connections are short and/or message sizes are small, are CPU intensive and scale poorly on multi-core systems with the BSD Socket API. We present MegaPipe, a new API for efficient, scalable network I/O for message-oriented workloads. The design of MegaPipe centers around the abstraction of a channel a per-core, bidirectional pipe between the kernel and user space, used to exchange both I/O requests and event notifications. On top of the channel abstraction, we introduce three key concepts of MegaPipe: partitioninglightweight socket (lwsocket), and batching

We implement MegaPipe in Linux and adapt memcached and nginx. Our results show that, by embracing a clean-slate design approach, MegaPipe is able to exploit new opportunities for improved performance and ease of programmability. In microbenchmarks on an 8-core server with 64 B messages, MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

Performance with Small Messages:

Small messages result in greater relative network I/O overhead in comparison to larger messages. In fact, the per-message overhead remains roughly constant and thus, independent of message size; in comparison with a 64 B message, a 1 KiB message adds only about 2% overhead due to the copying between user and kernel on our system, despite the large size difference.

Partitioned listening sockets:

Instead of a single listening socket shared across cores, MegaPipe allows applications to clone a listening socket and partition its associated queue across cores. Such partitioning improves performance with multiple cores while giving applications control over their use of parallelism.

Lightweight sockets:

Sockets are represented by file descriptors and hence inherit some unnecessary filerelated overheads. MegaPipe instead introduces lwsocket, a lightweight socket abstraction that is not wrapped in filerelated data structures and thus is free from system-wide synchronization.

System Call Batching:

MegaPipe amortizes system call overheads by batching asynchronous I/O requests and completion notifications within a channel.

(Source: HighScalability.com)

BloomJoin: BloomFilter + CoGroup

We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.

Asymmetric Joins

A common MapReduce use case for us is joining a large dataset with a global set of records against a smaller one—for example, we have a store with billions of transactions keyed by user ID, and want to find all transactions by users who were seen within the last 24 hours.   The standard way to execute this join with map-reduce is to sort each store by user ID, and then use a reducer function which skips transactions unless the user ID is present on both left and right-hand sides of the join (here the global transaction store is the left hand side and the specific set of user IDs is the right hand side):

 (when writing a Cascading job, this is just a CoGroup with an InnerJoin)

This solution works perfectly well, but from a performance standpoint, there’s a big problemwe pointlessly sorted the entire store of billions of transactions, even though only a few million of them made it into the output store!   What if we could get rid of irrelevant transactions, so we only bother sorting the ones which will end up in the output?

It turns out we can do this, with the use of a clever and widely-used data-structure: the Bloom Filter.

Bloom Filters

A bloom filter is a compact, probabilistic, representation of a set of objects.  Given an arbitrary object, it is cheap to test whether an object is a member of the original set; a property of bloom filters is that while an object that should be in the set  is never falsely rejected, it is possible for an object which should not be in the set to be accepted anyway.  That is to say, the filter does not allow false negatives, but can allow false positives.

For illustration, say we built a bloom filter out of records {A, E}, and want to test values {A, B, C, D} for membership:

Bloom filter simple

Say values C and D were correctly rejectedthese would be true negatives.  Key A will always be accepted since it was in the original filter, so we call it a true positive.  Key B however we call a false positive, since it was accepted even though it was not in the original set of keys.

A bloom filter works by hashing the keys that are inserted into the filter several times and marking the corresponding slots in a bit array.  When testing objects for membership in the filter, the same hash functions are applied to the items being tested.  The corresponding bits are then checked in the array:

As seen above, there are three possibilities when those bits are checked:

  • The item was actually in the original set and is accepted (object A)
  • The item was not in the original set, the corresponding bits are not set in the array, and it is rejected (object C)
  • The item was not in the original set, but the corresponding bits were set anyway, and it is accepted (object B)

Without going deep into the math behind bloom filters, there are four variables to consider when building a bloom filter: the number of keys, false positive rate, size of the filter, and number of hash functions used.  For our MapReduce use case, “size of filter” is a fixed quantity; we’re going to build a bloom filter as large as we can fit the memory of a map task.  We also have a fixed number of keys to insert in the filter (for a particular join, we’re building a bloom filter out of all keys on one side of the join).

Given a fixed number of keys and bits, we can choose a number of hash functions which minimizes our false positive rate.  For more details, check out the wiki article.

Putting it together

So how can a bloom filter improve the performance of our map-reduce job?   When joining two stores, before performing the reduce itself, we can use the bloom filter to filter out most of the records which would not have matched anything in the other side of the join:

Unlike the previous time, when we had to sort all the left hand side records before joining with the right side, we only have to sort the true positives {A}, as well as the false positives {B}.

BloomJoin

We’ve put all of this together into our open-source cascading_ext library on Github, as the sub-assemblies BloomJoin and BloomFilter.  The BloomJoin assembly closely mimics the CoGroup interface, but behind the scenes, we first use a bloom filter to filter the left-hand side pipe before performing a CoGroup.  If you use the CascadingHelper utility in the cascading_ext library, the assembly will work on its own:

Pipe source1 = new Pipe("source1");
Pipe source2 = new Pipe("source2");

Pipe joined = new BloomJoin(source1, new Fields("field1"),
  source2, new Fields("field3"));

CascadingUtil.get().getFlowConnector()
  .connect("Example flow", sources, sink, joined)
  .complete();

If using a raw HadoopFlowConnector, a FlowStepStrategy along with some properties will need to be attached to the flow:

Pipe source1 = new Pipe("source1");
Pipe source2 = new Pipe("source2");

Pipe joined = new BloomJoin(source1, new Fields("field1"),
  source2, new Fields("field3"));

Flow f = new HadoopFlowConnector(BloomProps.getDefaultProperties())
  .connect("Example flow", sources, sink, joined);
f.setFlowStepStrategy(new BloomAssemblyStrategy());
f.complete();

The classes BloomJoinExample and BloomJoinExampleWithoutCascadingUtil have some more examples which should work on your cluster out-of-the-box.

Performance

We measured the performance benefits of BloomJoin by running a benchmark job on our own data. For a sample dataset we took all request logs for one of our services, where each log entry contains a unique user ID. Each user ID may have many log entries in a single day, and is not necessarily seen every day. The job we ran took all logs from a particlar day (2012/12/26), and found all log entries from the previous two months associated with each of the user IDs from that day:

Logs in Full dataset (2 months of logs) 5,531,090,779
Logs from 12/26 134,035,584
Logs from full dataset matching users from 12/26 1,206,247,339 (21.8%)

We implemented this join two waysfirst with a standard CoGroup using an InnerJoin, and second with our BloomFilter assembly:

Metric CoGroup Bloom Join % Of original cost
Map output records 5,531,090,779 1,554,714,936 28.1
Map output bytes 3,145,258,140,017 788,744,906,949 25.1
Reduce shuffle bytes 806,870,037,807 177,418,598,374 22.0
Total time spent by all maps in occupied slots (ms) 620,127,159 288,005,428 46.4
Total time spent by all reduces in occupied slots (ms) 351,864,211 218,623,410 62.1
CPU time spent (ms) 1,069,814,490 527,423,620 49.3

We can see that filtering the map output for relevant records cuts our map output by 75% (for this input, even an optimal filter would only cut our  output by 78%), and correspondingly cuts shuffle bytes and CPU time.  Note that reduce time was only cut by 38% since copying data from mappers to reducers has some fixed costs per reduce task, and both jobs ran with the same number of reduces on this test.

Overall, using a bloom filter cut the cost of the job on our Hadoop cluster by around 50%a huge win.

Parameters to Tweak

There are a number of parameters available to tweak, depending on the the kind of data being processed, and the machines doing the processing.   The defaults should generally work fine, except for NUM_BLOOM_BITS, which will vary depending on the resources available to TaskTrackers on your cluster:

  • BloomProps.NUM_BLOOM_BITS:  The size of the bloom filter in-memory.  The parameter defaults to 300MB, but can be adjusted up or down depending on the free space on your map tasks.
  • BloomProps.NUM_SPLITS:  The number of parts to split the bloom filter into.   This determines the number of reduce tasks used when constructing the filter.

(via blog.liveramp.com)

Batoo JPA – The New JPA Implementation That Runs Over 15 Times Faster…

I loved the JPA 1.0 back in early 2000s. I started using it together with EJB 3.0 even before the stable releases. I loved it so much that I contributed bits and parts for JBoss 3.x implementations.

Those were the days our company was considerably still small in size. Creating new features and applications were more priority than the performance, because there were a lot of ideas that we have and we needed to develop and market those as fast as we can. Now, we no longer needed to write tedious and error prone xml descriptions for the data model and deployment descriptors. Nor we needed to use the curse called “XDoclet”.

On the other side, our company grew steadily, our web site has become the top portal in the country for live events and ticketing. We now had the performance problems! Although the company grew considerably, due to the economics in the industry, we did not make a lot of money. The challenge we had was our company was a ticketing company. Every e-commerce business has high and low seasons. But for ticketing there is low seasons and high hours. While you sell avarage x tickets an hour, when a blockbuster event goes on sale suddenly demand becomes 1000s of xs for an hour. Welcome to hell!

We worked day and night to tweak and enhance the application to use whatever available to keep it up on a big day. To be frank there was always a bigger event that was capable of bringing the site down no matter how hard we tried.

The dream was over, I came to realize that developing applications on top of frameworks is a bit “be careful!” along with “fun”.

I Kept Learning

I loved programming, I loved Java, I loved opensource. I developed almost every possible type applications on every possible platform I could. For the rest I went in and discovered stuff. I learned a lot from masters thanks to open source. In contrast to most, I read articles and codes written by great programmers like Linus Torvalds, Gavin King, Ed Merks and so many others.

With the experiences I gathered, I quit the ticketing company I loved and became a Software Consultant. This opened a new era in front of me that there were a lot of industries and a lot of different platforms and industries.

In each project I became the performance police of the application.

I am now the performance freak!

I Took The Red Pill!

One day I said to myself, could JPA be faster? If yes, how fast can it be. I spent about two weeks to create an entitymanager that persisted and loaded entities. Then I ran it and compared the results to ones off of Hibernate. The results were not really promising I was only about %50 faster than Hibernate in persisting and finding the entities. I spent another week to tweak the loops, cached metamodel chunks, changed access to classes from interfaces to abstract classes, modified the lists to arrays and so many other things. Suddenly I had a prototype that were 50+ times faster than Hibernate!

Development Of Batoo JPA

I was astonished by how drastically performance went up by just paying attention to performance centric coding. By then I was using Visual VM to measure the times spent in the JPA layer. I got down and wrote a self profiling tool that measured the CPU resources spent at the JPA Layer and started implementing every aspect of the JPA 2.0 Specification. After each iteration I re-run the benchmark and when the performance dropped considerably I went back to changes and inspected the new code line by line – the profiling tool I created reported performance hit of every line of the JPA Stack.

It took about 6 months to implement the specification as a whole, on top of it, I introduced a Maven Plugin to create bytecode instrumentation at build time and a complementary Eclipse Plugin to allow use of instrumentation in Eclipse IDE.

After a carriage of 6 months Batoo JPA was born in August 2012. it measured over 15 times faster than Hibernate.

Benchmark

As stated earlier, a benchmark was introduced to measure every micro development iteration of Batoo JPA. This benchmark was not created to put forward the areas Batoo JPA was fast so that other would believe in Batoo JPA, but was created to put together a most common domain model and persistence operations that existed in almost every JPA application – so that I knew how fast Batoo JPA was.

Performance Metrics

The scenario is:

  • A Person object
    • With phonenumbers – PhoneNumber object
    • With addresses – Address object
      • That point to country – Country Object

Common life-cycle tasks has been introduced:

  • Persist 100K person objects with two phone numbers and two addresses in lots of 10 per session
  • Locate and load 250K person objects with lots of 10 per session
  • Remove 5K person objects with lots of 5 per session
  • Update 100K person objects with lots of 100
  • Query person objects 25K times using Object Oriented Criteria Querying API.
  • Query person objects 25K times using JPQL – Java Persistence Query Language, an SQL-like query scripting language.

For the sake of simplicity, the benchmark was run on top of in-memory embedded Derby with the profiler slicing the times spent at the

  • Unit Test Layer
  • JPA Layer
  • Derby Layer

The times spent at the Unit Test Layer is omitted from the Results due to irrelevancy.

Results

The times given in the below tables are in milliseconds spent in the JPA layer while running the benchmark scenario. The same tests are run for Batoo and Hibernate JPA in different runs to isolate boot, memory, cache, garbage collection etc. effects.

The tables below show

  • the total time spent at Derby Layer as DB Operation Total
  • the type of the test as Test
  • the times for each test at Derby Layer as DB Operation
  • the times for each test at JPA Layer as Core Operation
  • the total time spent at JPA Layer as Core Operation Total
  • the total time spent at both JPA and Derby Layers as Operation Total


Below are the ratios of CPU resources spent by Hibernate and Batoo JPA. It is assumed that an an application generates average 1 save, 5 locate, 2 remove and 3 update and 5 + 5 total of ten queries in ratios. Now although these numbers are extremely dependent on the application nature, some sort of assumption is needed to measure the overall speed comparison.


Given the scenario above, Batoo JPA measures over 15 times faster than Hibernate – the leading JPA implementation.

As you may have noticed Batoo JPA not only performs insanely fast at the JPA Layer it also employs a number of optimizations to relieve the pressure on the database. This is why Batoo JPA measures half the time at DB Layer in comparison to the one off of Hibernate.

Interpretation of Results

We do appreciate that JPA is not the single part of an application. But we do believe that the current JPA implementation consume quite a bit of your server budget. While a typical application cluster spends CPU resources for persistence layer about %20 to %40, Batoo JPA will well be able to bring your cluster down to half of its size allowing you save a lot on licensing administration and hardware, as well as room to scale up even for non-cluster friendly applications – in my experience I saw applications running on 96 core Solaris systems simply because they are not scalable.

Conclusion

We have managed to create a JPA Product that allows you to enjoy the great features of JPA Technology but also do not require you to compromise on performance!

On top of that Batoo JPA is developed using the Apache Coding Standards and has valuable documentation within the code. The project codebase is released with LGPL license and there is absolutely no closed source part and we envision that it would be that way forever.

As stated earlier, it also has a complementary Maven and Eclipse plugin to provide instrumentation for build and development phases.

Batoo JPA deviates from the specification almost zero, making it easy for existing JPA applications be migrated to Batoo JPA, while requiring no additional learning phase to start using it.

Last but not the least, Batoo JPA not only saves you when you run your application, but also during the time you deploy your application. Batoo JPA employs parallel deployer managers to handle deployment in parallel. Considering a developer deploys the application during his / her development phase well 10x times a day if not 100, with a moderately large domain model this may take quite a bit of developers time when summed up. Although we haven’t made a concrete benchmark on deployment speed, we know that Batoo JPA  deploys about 3 4 times faster than Hibernate.

Reference

(via HighScalability.com)