Netflix open sources its data traffic cop, Suro

Netflix is open sourcing a tool called Suro that the company uses to direct data from its source to its destination in real time. More than just serving a key role in the Netflix data pipeline, though, it’s also a great example of the impressive — if not sometimes redunant — ecosystem of open source data-analysis tools making their way out of large web properties.

Netflix’s various applications generate tens of billions of events per day, and Suro collects them all before sending them on their way. Most head to Hadoop (via Amazon S3) for batch processing, while others head to Druid and ElasticSearch (via Apache Kafka) for real-time analysis. According to the Netflix blog post explaining Suro (which goes into much more depth), the company is also looking at how it might use real-time processing engines such as Storm or Samza to perform machine learning on event data.

An example Suro workflow. Source: Netflix

An example Suro workflow. Source: Netflix

To anyone familiar with the big data space, the names of the various technologies at play here are probably associated to some degree with the companies that developed them. Netflix created Suro, LinkedIn created Kafka and Samza, Twitter (actually, Backtype, which Twitter acquired) created Storm, and Metamarkets (see disclosure) created Druid. Suro, the blog post’s authors acknowledged, is based on the Apache Chukwa project and is similar to Apache Flume (created by Hadoop vendor Cloudera) and Facebook’s Scribe. Hadoop, of course, was all but created at Yahoo and has since seen notable ecosystem contributions from all sorts of web companies.

I sometimes wonder whether all these companies really need to be creating their own technologies all the time, or if they could often get by using the stuff their peers have already created. Like most things in life, though, the answer to that question is probably best decided on a case-by-case basis. Storm, for example, is becoming a very popular tool for stream processing, but LinkedIn felt like it needed something different and thus built Samza. Netflix decided it needed Suro as opposed to just using some pre-existing technologies (largely because of its cloud-heavy infrastructure running largely in Amazon Web Services), but also clearly uses lots of tools built elsewhere (including the Apache Cassandra database).

A diagram of LinkedIn's data architecture as of February 2013.

A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata.

Hopefully, the big winners in all this innovation will be mainstream technology users that don’t have the in-house talent (or, necessarily, the need) to develop these advanced systems in-house but could benefit from their capabilities. We’re already seeing Hadoop vendors, for example, trying to make projects such as Storm and the Spark processing framework usable by their enterprise customers, and it seems unlikely they’ll be the last. There are a whole lot of AWS users, after all, and they might want the capabilities Suro can provide without having to rely on Amazon to deliver them. (We’ll likely hear a lot more about the future of Hadoop as a data platform at our Structure Data conference, which takes place in March.)



Next-generation search and analytics with Apache Lucene and Solr 4

Use search engine technology to build fast, efficient, and scalable data-driven applications

Apache Lucene™ and Solr™ are highly capable open source search technologies that make it easy for organizations to enhance data access dramatically. With the 4.x line of Lucene and Solr, it’s easier than ever to add scalable search capabilities to your data-driven applications. Lucene and Solr committer Grant Ingersoll walks you through the latest Lucene and Solr features that relate to relevance, distributed search, and faceting. Learn how to leverage these capabilities to build fast, efficient, and scalable next-generation data-driven applications.

I began writing about Solr and Lucene for developerWorks six years ago (see Resources). Over those years, Lucene and Solr established themselves as rock-solid technologies (Lucene as a foundation for Java™ APIs, and Solr as a search service). For instance, they power search-based applications for Apple iTunes, Netflix, Wikipedia, and a host of others, and they help to enable the IBM Watson question-answering system.

Over the years, most people’s use of Lucene and Solr focused primarily on text-based search. Meanwhile, the new and interesting trend of big data emerged along with a (re)new(ed) focus on distributed computation and large-scale analytics. Big data often also demands real-time, large-scale information access. In light of this shift, the Lucene and Solr communities found themselves at a crossroads: the core underpinnings of Lucene began to show their age under the stressors of big data applications such as indexing all of the Twittersphere (see Resources). Furthermore, Solr’s lack of native distributed indexing support made it increasingly hard for IT organizations to scale their search infrastructure cost-effectively.

The community set to work overhauling the Lucene and Solr underpinnings (and in some cases the public APIs). Our focus shifted to enabling easy scalability, near-real-time indexing and search, and many NoSQL features — all while leveraging the core engine capabilities. This overhaul culminated in the Apache Lucene and Solr 4.x releases. These versions aim directly at solving next-generation, large-scale, data-driven access and analytics problems.

This article walks you through the 4.x highlights and shows you some code examples. First, though, you’ll go hands-on with a working application that demonstrates the concept of leveraging a search engine to go beyond search. To get the most from the article, you should be familiar with the basics of Solr and Lucene, especially Solr requests. If you’re not, see Resources for links that will get you started with Solr and Lucene.

Quick start: Search and analytics in action

Search engines are only for searching text, right? Wrong! At their heart, search engines are all about quickly and efficiently filtering and then ranking data according to some notion of similarity (a notion that’s flexibly defined in Lucene and Solr). Search engines also deal effectively with both sparse data and ambiguous data, which are hallmarks of modern data applications. Lucene and Solr are capable of crunching numbers, answering complex geospatial questions (as you’ll see shortly), and much more. These capabilities blur the line between search applications and traditional database applications (and even NoSQL applications).

For example, Lucene and Solr now:

  • Support several types of joins and grouping options
  • Have optional column-oriented storage
  • Provide several ways to deal with text and with enumerated and numerical data types
  • Enable you to define your own complex data types and storage, ranking, and analytics functions

A search engine isn’t a silver bullet for all data problems. But the fact that text search was the primary use of Lucene and Solr in the past shouldn’t prevent you from using them to solve your data needs now or in the future. I encourage you to think about using search engines in ways that go well outside the proverbial (search) box.

To demonstrate how a search engine can go beyond search, the rest of this section shows you an application that ingests aviation-related data into Solr. The application queries the data — most of which isn’t textual — and processes it with the D3 JavaScript library (see Resources) before displaying it. The data sets are from the Research and Innovative Technology Administration (RITA) of the U.S. Department of Transportation’s Bureau of Transportation Statistics and from OpenFlights. The data includes details such as originating airport, destination airport, time delays, causes of delays, and airline information for all flights in a particular time period. By using the application to query this data, you can analyze delays between particular airports, traffic growth at specific airports, and much more.

Start by getting the application up and running, and then look at some of its interfaces. Keep in mind as you go along that the application interacts with the data by interrogating Solr in various ways.


To get started, you need the following prerequisites:

  • Lucene and Solr.
  • Java 6 or higher.
  • A modern web browser. (I tested on Google Chrome and Firefox.)
  • 4GB of disk space — less if you don’t want to use all of the flight data.
  • Terminal access with a bash (or similar) shell on *nix. For Windows, you need Cygwin. I only tested on OS X with the bash shell.
  • wget if you choose to download the data by using the download script that’s in the sample code package. You can also download the flight data manually.
  • Apache Ant 1.8+ for compilation and packaging purposes, if you want to run any of the Java code examples.

See Resources for links to the Lucene, Solr, wget, and Ant download sites.

With the prerequisites in place, follow these steps to get the application up and running:

  1. Download this article’s sample code ZIP file and unzip it to a directory of your choice. I’ll refer to this directory as $SOLR_AIR.
  2. At the command line, change to the $SOLR_AIR directory:
    cd $SOLR_AIR
  3. Start Solr:
  4. Run the script that creates the necessary fields to model the data:
  5. Point your browser at http://localhost:8983/solr/#/ to display the new Solr Admin UI. Figure 1 shows an example:
    Figure 1. Solr UI

    Screen capture the new Solr UI

  6. At the terminal, view the contents of the bin/ script for details on what to download from RITA and OpenFlights. Download the data sets either manually or by running the script:

    The download might take significant time, depending on your bandwidth.

  7. After the download is complete, index some or all of the data.

    To index all data:


    To index data from a single year, use any value between 1987 and 2008 for the year. For example:

    bin/ 1987
  8. After indexing is complete (which might take significant time, depending on your machine), point your browser at http://localhost:8983/solr/collection1/travel. You’ll see a UI similar to the one in Figure 2:
    Figure 2. The Solr Air UI

    Screen capture of an example Solr AIR screen

Exploring the data

With the Solr Air application up and running, you can look through the data and the UI to get a sense of the kinds of questions you can ask. In the browser, you should see two main interface points: the map and the search box. For the map, I started with D3’s excellent Airport example (see Resources). I modified and extended the code to load all of the airport information directly from Solr instead of from the example CSV file that comes with the D3 example. I also did some initial statistical calculations about each airport, which you can see by mousing over a particular airport.

I’ll use the search box to showcase a few key pieces of functionality that help you build sophisticated search and analytics applications. To follow along in the code, see the solr/collection1/conf/velocity/map.vm file.

The key focus areas are:

  • Pivot facets
  • Statistical functionality
  • Grouping
  • Lucene and Solr’s expanded geospatial support

Each of these areas helps you get answers such as the average delay of arriving airplanes at a specific airport, or the most common delay times for an aircraft that’s flying between two airports (per airline, or between a certain starting airport and all of the nearby airports). The application uses Solr’s statistical functionality, combined with Solr’s longstanding faceting capabilities, to draw the initial map of airport “dots” — and to generate basic information such as total flights and average, minimum, and maximum delay times. (This capability alone is a fantastic way to find bad data or at least extreme outliers.) To demonstrate these areas (and to show how easy it is to integrate Solr with D3), I’ve implemented a bit of lightweight JavaScript code that:

  1. Parses the query. (A production-quality application would likely do most of the query parsing on the server side or even as a Solr query-parser plugin.)
  2. Creates various Solr requests.
  3. Displays the results.

The request types are:

  • Lookup per three-letter airport code, such as RDU or SFO.
  • Lookup per route, such as SFO TO ATL or RDU TO ATL. (Multiple hops are not supported.)
  • Clicking the search button when the search box is empty to show various statistics for all flights.
  • Finding nearby airports by using the near operator, as in near:SFO or near:SFO TO ATL.
  • Finding likely delays at various distances of travel (less than 500 miles, 500 to 1000, 1000 to 2000, 2000 and beyond), as in likely:SFO.
  • Any arbitrary Solr query to feed to Solr’s /travel request handler, such as &q=AirportCity:Francisco.

The first three request types in the preceding list are all variations of the same type. These variants highlight Solr’s pivot faceting capabilities to show, for instance, the most common arrival delay times per route (such asSFO TO ATL) per airline per flight number. The near option leverages the new Lucene and Solr spatial capabilities to perform significantly enhanced spatial calculations such as complex polygon intersections. Thelikely option showcases Solr’s grouping capabilities to show airports at a range of distances from an originating airport that had arrival delays of more than 30 minutes. All of these request types augment the map with display information through a small amount of D3 JavaScript. For the last request type in the list, I simply return the associated JSON. This request type enables you to explore the data on your own. If you use this request type in your own applications, you naturally would want to use the response in an application-specific way.

Now try out some queries on your own. For instance, if you search for SFO TO ATL, you should see results similar to those in Figure 3:

Figure 3. Example SFO TO ATL screen

Screen capture from Solr Air showing SFO TO ATL results

In Figure 3, the two airports are highlighted in the map that’s on the left. The Route Stats list on the right shows the most common arrival delay times per flight per airline. (I loaded the data for 1987 only.) For instance, it tells you that five times Delta flight 156 was delayed five minutes on arriving in Atlanta and was six minutes early on four occasions.

You can see the underlying Solr request in your browser’s console (for example, in Chrome on the Mac, choose View -> Developer -> Javascript Console) and in the Solr logs. The SFO-TO-ATL request that I used (broken into three lines here solely for formatting purposes) is:

AND Dest:ATL&q=*:*&facet.pivot=UniqueCarrier,FlightNum,ArrDelay&

The facet.pivot parameter provides the key functionality in this request. facet.pivot pivots from the airline (called UniqueCarrier) to FlightNum through to ArrDelay, thereby providing the nested structure that’s displayed in Figure 3‘s Route Stats.

If you try a near query, as in near:JFK, your result should look similar to Figure 4:

Figure 4. Example screen showing airports near JFK

Screen capture from Solr Air showing JFK and nearby airports

The Solr request that underlies near queries takes advantage of Solr’s new spatial functionality, which I’ll detail later in this article. For now, you can likely discern some of the power of this new functionality by looking at the request itself (shortened here for formatting purposes):

&fq=source:Airports&q=AirportLocationJTS:"IsWithin(Circle(40.639751,-73.778925 d=3))"

As you might guess, the request looks for all airports that fall within a circle whose center is at 40.639751 degrees latitude and -73.778925 degrees longitude and that has a radius of 3 degrees, which is roughly 111 kilometers.

By now you should have a strong sense that Lucene and Solr applications can slice and dice data — numerical, textual, or other — in interesting ways. And because Lucene and Solr are both open source, with a commercial-friendly license, you are free to add your own customizations. Better yet, the 4.x line of Lucene and Solr increases the number of places where you can insert your own ideas and functionality without needing to overhaul all of the code. Keep this capability in mind as you look next at some of the highlights of Lucene 4 (version 4.4 as of this writing) and then at the Solr 4 highlights.

Lucene 4: Foundations for next-generation search and analytics

A sea change

Lucene 4 is nearly a complete rewrite of the underpinnings of Lucene for better performance and flexibility. At the same time, this release represents a sea change in the way the community develops software, thanks to Lucene’s new randomized unit-testing framework and rigorous community standards that relate to performance. For instance, the randomized test framework (which is available as a packaged artifact for anyone to use) makes it easy for the project to test the interactions of variables such as JVMs, locales, input content and queries, storage formats, scoring formulas, and many more. (Even if you never use Lucene, you might find the test framework useful in your own projects.)

Some of the key additions and changes to Lucene are in the categories of speed and memory, flexibility, data structures, and faceting. (To see all of the details on the changes in Lucene, read the CHANGES.txt file that’s included within every Lucene distribution.)

Speed and memory

Although prior Lucene versions are generally considered to be fast enough — especially, relative to comparable general-purpose search libraries — enhancements in Lucene 4 make many operations significantly faster than in previous versions.

The graph in Figure 5 captures the performance of Lucene indexing as measured in gigabytes per hour. (Credit Lucene committer Mike McCandless for the nightly Lucene benchmarking graphs; seeResources.) Figure 5 shows that a huge performance improvement occurred in the first half of May [[year?]]:

Figure 5. Lucene indexing performance

Graph of Lucene indexing performance that shows an increase from 100GB per hour to approximately 270GB per hour in the first half of May [[year?]]

The improvement that Figure 5 shows comes from a series of changes that were made to how Lucene builds its index structures and how it handles concurrency when building them (along with a few other changes, including JVM changes and use of solid-state drives). The changes focused on removing synchronizations while Lucene writes the index to disk; for details (which are beyond this article’s scope) seeResources for links to Mike McCandless’s blog posts.

In addition to improving overall indexing performance, Lucene 4 can perform near real time (NRT) indexing operations. NRT operations can significantly reduce the time that it takes for the search engine to reflect changes to the index. To use NRT operations, you must do some coordination in your application between Lucene’s IndexWriter andIndexReader. Listing 1 (a snippet from the download package’s src/main/java/ file) illustrates this interplay:

Listing 1. Example of NRT search in Lucene
doc = new HashSet<IndexableField>();
index(writer, doc);
//Get a searcher
IndexSearcher searcher = new IndexSearcher(;
//Now, index one more doc
doc.add(new StringField("id", "id_" + 100, Field.Store.YES));
doc.add(new TextField("body", "This is document 100.", Field.Store.YES));
//The results are still 100
//Don't commit; just open a new searcher directly from the writer
searcher = new IndexSearcher(, false));
//The results now reflect the new document that was added

In Listing 1, I first index and commit a set of documents to the Directory and then search the Directory — the traditional approach in Lucene. NRT comes in when I proceed to index one more document: Instead of doing a full commit, Lucene creates a new IndexSearcher from the IndexWriter and then does the search. You can run this example by changing to the $SOLR_AIR directory and executing this sequence of commands:

  1. ant compile
  2. cd build/classes
  3. java -cp ../../lib/*:. IndexingExamples

Note: I grouped several of this article’s code examples into, so you can use the same command sequence to run the later examples in Listing 2 and Listing 4.

The output that prints to the screen is:

Num docs: 100
Num docs: 100
Num docs: 101

Lucene 4 also contains memory improvements that leverage some more-advanced data structures (which I cover in more detail in Finite state automata and other goodies). These improvements not only reduce Lucene’s memory footprint but also significantly speed up queries that are based on wildcards and regular expressions. Additionally, the code base moved away from working with Java String objects in favor of managing large allocations of byte arrays. (The BytesRef class is seemingly ubiquitous under the covers in Lucene now.) As a result, String overhead is reduced and the number of objects on the Java heap is under better control, which reduces the likelihood of stop-the-world garbage collections.

Some of the flexibility enhancements also yield performance and storage improvements because you can choose better data structures for the types of data that your application is using. For instance, as you’ll see next, you can choose to index/store unique keys (which are dense and don’t compress well) one way in Lucene and index/store text in a completely different way that better suits text’s sparseness.


What’s a segment? A Lucene segment is a subset of the overall index. In many ways a segment is a self-contained mini-index. Lucene builds its index by using segments to balance the availability of the index for searching with the speed of writing. Segments are write-once files during indexing, and a new one is created every time you commit during writing. In the background, by default, Lucene periodically merges smaller segments into larger segments to improve read performance and reduce system overhead. You can exercise complete control over this process.

The flexibility improvements in Lucene 4.x unlock a treasure-trove of opportunity for developers (and researchers) who want to squeeze every last bit of quality and performance out of Lucene. To enhance flexibility, Lucene offers two new well-defined plugin points. Both plugin points have already had a significant impact on the way Lucene is both developed and used.

The first new plugin point is designed to give you deep control over the encoding and decoding of a Lucene segment. The Codec class defines this capability. Codec gives you control over the format of the postings list (that is, the inverted index), Lucene storage, boost factors (also called norms), and much more.

In some applications you might want to implement your own Codec. But it’s much more likely that you’ll want to change the Codec that’s used for a subset of the document fields in the index. To understand this point, it helps to think about the kinds of data you are putting in your application. For instance, identifying fields (for example, your primary key) are usually unique. Because primary keys only ever occur in one document, you might want to encode them differently from how you encode the body of an article’s text. You don’t actually change the Codec in these cases. Instead, you change one of the lower-level classes that the Codec delegates to.

To demonstrate, I’ll show you a code example that uses my favorite Codec, the SimpleTextCodec. TheSimpleTextCodec is what it sounds like: a Codec for encoding the index in simple text. (The fact thatSimpleTextCodec was written and passes Lucene’s extensive test framework is a testament to Lucene’s enhanced flexibility.) SimpleTextCodec is too large and slow to use in production, but it’s a great way to see what a Lucene index looks like under the covers, which is why it is my favorite. The code in Listing 2 changes a Codec instance to SimpleTextCodec:

Listing 2. Example of changing Codec instances in Lucene
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
//Let's write to disk so that we can see what it looks like
writer = new IndexWriter(directory, conf);
index(writer, doc);//index the same docs as before

By running the Listing 2 code, you create a local build/classes/simpletext directory. To see the Codec in action, change to build/classes/simpletext and open the .cfs file in a text editor. You can see that the .cfs file truly is plain old text, like the snippet in Listing 3:

Listing 3. Portion of _0.cfs plain-text index file
  term id_97
    doc 97
  term id_98
    doc 98
  term id_99
    doc 99
doc 0
  numfields 4
  field 0
    name id
    type string
    value id_100
  field 1
    name body
    type string
    value This is document 100.

For the most part, changing the Codec isn’t useful until you are working with extremely large indexes and query volumes, or if you are a researcher or search-engine maven who loves to play with bare metal. Before changing Codecs in those cases, do extensive testing of the various available Codecs by using your actual data. Solr users can set and change these capabilities by modifying simple configuration items. Refer to the Solr Reference Guide for more details (see Resources).

The second significant new plugin point makes Lucene’s scoring model completely pluggable. You are no longer limited to using Lucene’s default scoring model, which some detractors claim is too simple. If you prefer, you can use alternative scoring models such as BM25 and Divergence from Randomness (see Resources), or you can write your own. Why write your own? Perhaps your “documents” represent molecules or genes; you want a fast way of ranking them, but term frequency and document frequency aren’t applicable. Or perhaps you want to try out a new scoring model that you read about in a research paper to see how it works on your content. Whatever your reason, changing the scoring model requires you to change the model at indexing time through the IndexWriterConfig.setSimilarity(Similarity) method, and at search time through theIndexSearcher.setSimilarity(Similarity) method. Listing 4 demonstrates changing the Similarity by first running a query that uses the default Similarity and then re-indexing and rerunning the query using Lucene’s BM25Similarity:

Listing 4. Changing Similarity in Lucene
conf = new IndexWriterConfig(Version.LUCENE_44, analyzer);
directory = new RAMDirectory();
writer = new IndexWriter(directory, conf);
index(writer, DOC_BODIES);
searcher = new IndexSearcher(;
System.out.println("Lucene default scoring:");
TermQuery query = new TermQuery(new Term("body", "snow"));
printResults(searcher, query, 10);

BM25Similarity bm25Similarity = new BM25Similarity();
Directory bm25Directory = new RAMDirectory();
writer = new IndexWriter(bm25Directory, conf);
index(writer, DOC_BODIES);
searcher = new IndexSearcher(;
System.out.println("Lucene BM25 scoring:");
printResults(searcher, query, 10);

Run the code in Listing 4 and examine the output. Notice that the scores are indeed different. Whether the results of the BM25 approach more accurately reflect a user’s desired set of results is ultimately up to you and your users to decide. I recommend that you set up your application in a way that makes it easy for you to run experiments. (A/B testing should help.) Then compare not only the Similarity results, but also the results of varying query construction, Analyzer, and many other items.

Finite state automata and other goodies

A complete overhaul of Lucene’s data structures and algorithms spawned two especially interesting advancements in Lucene 4:

  • DocValues (also known as column stride fields).
  • Finite State Automata (FSA) and Finite State Transducers (FST). I’ll refer to both as FSAs for the remainder of this article. (Technically, an FST is outputs values as its nodes are visited, but that distinction isn’t important for the purposes of this article.)

Both DocValues and FSA provide significant new performance benefits for certain types of operations that can affect your application.

On the DocValues side, in many cases applications need to access all of the values of a single field very quickly, in sequence. Or applications need to do quick lookups of values for sorting or faceting, without incurring the cost of building an in-memory version from an index (a process that’s known as un-inverting). DocValues are designed to answer these kinds of needs.

An application that does a lot of wildcard or fuzzy queries should see a significant performance improvement due to the use of FSAs. Lucene and Solr now support query auto-suggest and spell-checking capabilities that leverage FSAs. And Lucene’s default Codec significantly reduces disk and memory footprint by using FSAs under the hood to store the term dictionary (the structure that Lucene uses to look up query terms during a search). FSAs have many uses in language processing, so you might also find Lucene’s FSA capabilities to be instructive for other applications.

Figure 6 shows an FSA that’s built from using the words moppop,mothstarstop, and top, along with associated weights. From the example, you can imagine starting with input such as moth, breaking it down into its characters (m-o-t-h), and then following the arcs in the FSA.

Figure 6. Example of an FSA

Illustration of an FSA from

Listing 5 (excerpted from the file in this article’s sample code download) shows a simple example of building your own FSA by using Lucene’s API:

Listing 5. Example of a simple Lucene automaton
String[] words = {"hockey", "hawk", "puck", "text", "textual", "anachronism", "anarchy"};
Collection<BytesRef> strings = new ArrayList<BytesRef>();
for (String word : words) {
  strings.add(new BytesRef(word));

//build up a simple automaton out of several words
Automaton automaton = BasicAutomata.makeStringUnion(strings);
CharacterRunAutomaton run = new CharacterRunAutomaton(automaton);
System.out.println("Match: " +"hockey"));
System.out.println("Match: " +"ha"));

In Listing 5, I build an Automaton out of various words and feed it into a RunAutomaton. As the name implies, aRunAutomaton runs input through the automaton, in this case to match the input strings that are captured in the print statements at the end of Listing 5. Although this example is trivial, it lays the groundwork for understanding much more advanced capabilities that I’ll leave to readers to explore (along with DocValues) in the Lucene APIs. (See Resources for relevant links.).


At its core, faceting generates a count of document attributes to give users an easy way to narrow down their search results without making them guess which keywords to add to the query. For example, if someone searches a shopping site for televisions, facets tell them how many TVs models are made by which manufacturers. Increasingly, faceting is also often used to power search-based business analytics and reporting tools. By using more-advanced faceting capabilities, you give users the ability to slice and dice facets in interesting ways.

Facets were long a hallmark of Solr (since version 1.1). Now Lucene has its own faceting module that stand-alone Lucene applications can leverage. Lucene’s faceting module it isn’t as rich in functionality as Solr’s, but it does offer some interesting tradeoffs. Lucene’s faceting module isn’t dynamic, in that you must make some faceting decisions at indexing time. But it is hierarchical, and it doesn’t have the cost of un-inverting fields into memory dynamically.

Listing 6 (part of the sample code’s file) showcases some of Lucene’s new faceting capabilities:

Listing 6. Lucene faceting examples
DirectoryTaxonomyWriter taxoWriter = 
     new DirectoryTaxonomyWriter(facetDir, IndexWriterConfig.OpenMode.CREATE);
FacetFields facetFields = new FacetFields(taxoWriter);
for (int i = 0; i < DOC_BODIES.length; i++) {
  String docBody = DOC_BODIES[i];
  String category = CATEGORIES[i];
  Document doc = new Document();
  CategoryPath path = new CategoryPath(category, '/');
  //Setup the fields
  facetFields.addFields(doc, Collections.singleton(path));//just do a single category path
  doc.add(new StringField("id", "id_" + i, Field.Store.YES));
  doc.add(new TextField("body", docBody, Field.Store.YES));
DirectoryReader reader =;
IndexSearcher searcher = new IndexSearcher(reader);
DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
ArrayList<FacetRequest> facetRequests = new ArrayList<FacetRequest>();
CountFacetRequest home = new CountFacetRequest(new CategoryPath("Home", '/'), 100);
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Sports", '/'), 10));
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Weather", '/'), 10));
FacetSearchParams fsp = new FacetSearchParams(facetRequests);

FacetsCollector facetsCollector = FacetsCollector.create(fsp, reader, taxor); MatchAllDocsQuery(), facetsCollector);

for (FacetResult fres : facetsCollector.getFacetResults()) {
  FacetResultNode root = fres.getFacetResultNode();
  printFacet(root, 0);

The key pieces in Listing 6 that go beyond normal Lucene indexing and search are in the use of theFacetFieldsFacetsCollectorTaxonomyReader, and TaxonomyWriter classes. FacetFields creates the appropriate field entries in the document and works in concert with TaxonomyWriter at indexing time. At search time, TaxonomyReader works with FacetsCollector to get the appropriate counts for each category. Note, also, that Lucene’s faceting module creates a secondary index that, to be effective, must be kept in sync with the main index. Run the Listing 6 code by using the same command sequence you used for the earlier examples, except substitute FacetExamples for IndexingExamples in the java command. You should get:

Home (0.0)
 Home/Children (3.0)
  Home/Children/Nursery Rhymes (3.0)
 Home/Weather (2.0)

 Home/Sports (2.0)
  Home/Sports/Rock Climbing (1.0)
  Home/Sports/Hockey (1.0)
 Home/Writing (1.0)
 Home/Quotes (1.0)
  Home/Quotes/Yoda (1.0)
 Home/Music (1.0)
  Home/Music/Lyrics (1.0)

Notice that in this particular implementation I’m not including the counts for the Home facet, because including them can be expensive. That option is supported by setting up the appropriate FacetIndexingParams, which I’m not covering here. Lucene’s faceting module has additional capabilities that I’m not covering. I encourage you to explore them — and other new Lucene features that this article doesn’t touch on — by checking out the article Resources. And now, on to Solr 4.x.

Solr 4: Search and analytics at scale

From an API perspective, much of Solr 4.x looks and feels the same as previous versions. But 4.x contains numerous enhancements that make it easier to use, and more scalable, than ever. Solr also enables you to answer new types of questions, all while leveraging many of the Lucene enhancements that I just outlined. Other changes are geared toward the developer’s getting-started experience. For example, the all-new Solr Reference Guide (see Resources) provides book-quality documentation of every Solr release (starting with 4.4). And Solr’s new schemaless capabilities make it easy to add new data to the index quickly without first needing to define a schema. You’ll learn about Solr’s schemaless feature in a moment. First you’ll look at some of the new search, faceting, and relevance enhancements in Solr, some of which you saw in action in the Solr Air application.

Search, faceting, and relevance

Several new Solr 4 capabilities are designed to make it easier — on both the indexing side and the search-and-faceting side — to build next-generation data-driven applications. Table 1 summarizes the highlights and includes command and code examples when applicable:

Table 1. Indexing, searching, and faceting highlights in Solr 4
Name Description Example
Pivot faceting Gather counts for all of a facet’s subfacets, as filtered through the parent facet. See the Solr Air examplefor more details. Pivot on a variety of fields:
http://localhost:8983/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=&q=*:*  &facet.pivot=Origin,Dest,UniqueCarrier,FlightNum,ArrDelay&indent=true
New relevance function queries Access various index-level statistics such as document frequency and term frequency as part of a function query. Add the Document frequency for the term Origin:SFO to all returned documents:
http://localhost:8983/solr/collection1/travel?&wt=json&q=*:*&fl=*, {!func}docfreq('Origin',%20'SFO')&indent=true
Note that this command also uses the new DocTransformers capability.
Joins Represent more-complex document relationships and then join them at search time. More-complex joins are slated for future releases of Solr. Return only flights that have originating airport codes that appear in the Airport data set (and compare to the results of a request without the join):
Codecsupport Change theCodec for the index and the postings format for individual fields. Use the SimpleTextCodec for a field:
<fieldType name="string_simpletext" postingsFormat="SimpleText" />
New update processors Use Solr’s Update Processor framework to plug in code to change documents before they are indexed but after they are sent to Solr.
  • Field mutating (for example, concatenate fields, parse numerics, trim)
  • Scripting. Use JavaScript or other code that’s supported by the JavaScript engine to process documents. See the update-script.js file in the Solr Air example.
  • Language detection (technically available in 3.5, but worth mentioning here) for identifying the language (such as English or Japanese) that’s used in a document.
Atomic updates Send in just the parts of a document that have changed, and let Solr take care of the rest. From the command line, using cURL, change the origin of document 243551 to be FOO:
curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [{"id": "243551","Origin": {"set":"FOO"}}]'

You can run the first three example commands in Table 1 in your browser’s address field (not the Solr Air UI) against the Solr Air demo data.

For more details on relevance functions, joins, and Codec — and other new Solr 4 features — see Resourcesfor relevant links to the Solr Wiki and elsewhere.

Scaling, NoSQL, and NRT

Probably the single most significant change to Solr in recent years was that building a multinode scalable search solution became much simpler. With Solr 4.x, it’s easier than ever to scale Solr to be the authoritative storage and access mechanism for billions of records — all while enjoying the search and faceting capabilities that Solr has always been known for. Furthermore, you can rebalance your cluster as your capacity needs change, as well as take advantage of optimistic locking, atomic updates of content, and real-time retrieval of data even if it hasn’t been indexed yet. The new distributed capabilities in Solr are referred to collectively as SolrCloud.

How does SolrCloud work? Documents that are sent to Solr 4 when it’s running in (optional) distributed mode are routed according to a hashing mechanism to a node in the cluster (called the leader). The leader is responsible for indexing the document into a shard. A shard is a single index that is served by a leader and zero or more replicas. As an illustration, assume that you have four machines and two shards. When Solr starts, each of the four machines communicates with the other three. Two of the machines are elected leaders, one for each shard. The other two nodes automatically become replicas of one of the shards. If one of the leaders fails for some reason, a replica (in this case the only replica) becomes the leader, thereby guaranteeing that the system still functions properly. You can infer from this example that in a production system enough nodes must participate to ensure that you can handle system outages.

To see SolrCloud in action, you can run launch a two-node, two-shard system by running the script that you used in the Solr Air example with a -z flag. From the *NIX command line, first shut down your old instance:

kill -9 PROCESS_ID

Then restart the system:

bin/ -c -z

The -c flag erases the old index. The -z flag tells Solr to start up with an embedded version of Apache Zookeeper.

Apache Zookeeper

Zookeeper is a distributed coordination system that’s designed to elect leaders, establish a quorum, and perform other tasks to coordinate the nodes in a cluster. Thanks to Zookeeper, a Solr cluster never suffers from “split-brain” syndrome, whereby part of the cluster behaves independently of the rest of the cluster as the result of a partitioning event. See Resources to learn more about Zookeeper.

Point your browser at the SolrCloud admin page, http://localhost:8983/solr/#/~cloud, to verify that two nodes are participating in the cluster. You can now re-index your content, and it will be spread across both nodes. All queries to the system are also automatically distributed. You should get the same number of hits for a match-all-documents search against two nodes that you got for one node.

The script launches Solr with the following command for the first node:

java -Dbootstrap_confdir=$SOLR_HOME/solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

The script tells the second node where Zookeeper is:

java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

Embedded Zookeeper is great for getting started, but to ensure high availability and fault tolerance for production systems, set up a stand-alone set of Zookeeper instances in your cluster.

Stacked on top the SolrCloud capabilities are support for NRT and many NoSQL-like functions, such as:

  • Optimistic locking
  • Atomic updates
  • Real-time gets (retrieving a specific document before it is committed)
  • Transaction-log-backed durability

Many of the distributed and NoSQL functions in Solr — such as automatic versioning of documents and transaction logs — work out of the box. For a few other features, the descriptions and examples in Table 2 will be helpful:

Table 2. Summary of distributed and NoSQL features in Solr 4
Name Description Example
Realtime get Retrieve a document, by ID, regardless of its state of indexing or distribution. Get the document whose ID is 243551:
Shard splitting Split your index into smaller shards so they can be migrated to new nodes in the cluster. Split shard1 into two shards:
NRT Use NRT to search for new content much more quickly than in previous versions. Turn on <autoSoftCommit> in your solrconfig.xml file. For example:
Document routing Specify which documents live on which nodes. Ensure that all of a user’s data is on certain machines. Read Joel Bernstein’s blog post (see Resources).
Collections Create, delete, or update collections as needed, programmatically, using Solr’s new collections API. Create a new collection named hockey:

Going schemaless

Schemaless: Marketing hype?

Data collections rarely lack a schema. Schemaless is a marketing term that’s derived from a data-ingestion engine’s ability to react appropriately to the data “telling” the engine what the schema is — instead of the engine specifying the form that the data must take. For instance, Solr can accept JSON input and can index content appropriately on the basis of the schema that’s implicitly defined in the JSON. As someone pointed out to me on Twitter, less schema is a better term than schemaless, because you define the schema in one place (such as a JSON document) instead of two (such as a JSON document and Solr).

Based on my experience, in the vast majority of cases you should not use schemaless in a production system unless you enjoy debugging errors at 2 a.m. when your system thinks it has one type of data and in reality has another.

Solr’s schemaless functionality enables clients to add content rapidly without the overhead of first defining a schema.xml file. Solr examines the incoming data and passes it through a cascading set of value parsers. The value parsers guess the data’s type and then automatically add the fields to the internal schema and add the content to the index.

A typical production system (with some exceptions) shouldn’t use schemaless, because the value guessing isn’t always perfect. For instance, the first time Solr sees a new field, it might identify the field as an integer and thus define an integer FieldType in the underlying schema. But you may discover three weeks later that the field is useless for searching because the rest of the content that Solr sees for that field consists of float point values.

However, schemaless is especially helpful for early-stage development or for indexing content whose format you have little to no control over. For instance, Table 2 includes an example of using the collections API in Solr to create a new collection:


After you create the collection, you can use schemaless to add content to it. First, though, take a look at the current schema. As part of implementing schemaless support, Solr also added Representational State Transfer (REST) APIs for accessing the schema. You can see all of the fields defined for the hockey collection by pointing your browser (or cURL on the command line) at http://localhost:8983/solr/hockey/schema/fields. You see all of the fields from the Solr Air example. The schema uses those fields because the create option used my default configuration as the basis for the new collection. You can override that configuration if you want. (A side note: The script that’s included in the sample code download uses the new schema APIs to create all of the field definitions automatically.)

To add to the collection by using schemaless, run:


The following JSON is added to the hockey collection that you created earlier:

        "id": "id1",
        "team": "Carolina Hurricanes",
        "description": "The NHL franchise located in Raleigh, NC",
        "cupWins": 1

As you know from examining the schema before you added this JSON to the collection, the teamdescription, and cupWins fields are new. When the script ran, Solr guessed their types automatically and created the fields in the schema. To verify, refresh the results at http://localhost:8983/solr/hockey/schema/fields. You should now see teamdescription, andcupWins all defined in the list of fields.

Spatial (not just geospatial) improvements

Solr’s longstanding support for point-based spatial searching enables you to find all documents that are within some distance of a point. Although Solr supports this approach in an n-dimensional space, most people use it for geospatial search (for example, find all restaurants near my location). But until now, Solr didn’t support more-involved spatial capabilities such as indexing polygons or performing searches within indexed polygons. Some of the highlights of the new spatial package are:

  • Support through the Spatial4J library (see Resources) for many new spatial types — such as rectangles, circles, lines, and arbitrary polygons — and support for the Well Known Text (WKT) format
  • Multivalued indexed fields, which you can use to encode multiple points into the same field
  • Configurable precision that gives the developer more control over accuracy versus computation speed
  • Fast filtering of content
  • Query support for Is WithinContains, and IsDisjointTo
  • Optional support for the Java Topological Suite (JTS) (see Resources)
  • Lucene APIs and artifacts

The schema for the Solr Air application has several field types that are set up to take advantage of this new spatial functionality. I defined two field types for working with the latitude and longitude of the airport data:

<fieldType name="location_jts" 
distErrPct="0.025" spatialContextFactory=
maxDistErr="0.000009" units="degrees"/>

<fieldType name="location_rpt" 
distErrPct="0.025" geo="true" maxDistErr="0.000009" units="degrees"/>

The location_jts field type explicitly uses the optional JTS integration to define a point, and thelocation_rpt field type doesn’t. If you want to index anything more complex than simple rectangles, you need to use the JTS version. The fields’ attributes help to define the system’s accuracy. These attributes are required at indexing time because Solr, via Lucene and Spatial4j, encodes the data in multiple ways to ensure that the data can be used efficiently at search time. For your applications, you’ll likely want to run some tests with your data to determine the tradeoffs to make in terms of index size, precision, and query-time performance.

In addition, the near query that’s used in the Solr Air application uses the new spatial-query syntax (IsWithinon a Circle) for finding airports near the specified origin and destination airports.

New administration UI

In wrapping up this section on Solr, I would be remiss if I didn’t showcase the much more user-friendly and modern Solr admin UI. The new UI not only cleans up the look and feel but also adds new functionality for SolrCloud, document additions, and much more.

For starters, when you first point your browser at http://localhost:8983/solr/#/, you should see a dashboard that succinctly captures much of the current state of Solr: memory usage, working directories, and more, as in Figure 7:

Figure 7. Example Solr dashboard

Screen capture of an example Solr dashboard

If you select Cloud in the left side of the dashboard, the UI displays details about SolrCloud. For example, you get in-depth information about the state of configuration, live nodes, and leaders, as well as visualizations of the cluster topology. Figure 8 shows an example. Take a moment to work your way through all of the cloud UI options. (You must be running in SolrCloud mode to see them.)

Figure 8. Example SolrCloud UI

Screen capture of a SolrCloud UI example

The last area of the UI to cover that’s not tied to a specific core/collection/index is the Core Admin set of screens. These screens provides point-and-click control over the administration of cores, including adding, deleting, reloading, and swapping cores. Figure 9 shows the Core Admin UI:

Figure 9. Example of Core Admin UI

Screen capture of the core Solr admin UI

By selecting a core from the Core list, you access an overview of information and statistics that are specific to that core. Figure 10 shows an example:

Figure 10. Example core overview

Screen capture of a core overview example in the Solr UI

Most of the per-core functionality is similar to the pre-4.x UI’s functionality (albeit in a much more pleasant way), with the exception of the Documents option. You can use the Documents option to add documents in various formats (JSON, CSV, XML, and others) to the collection directly from the UI, as Figure 11:

Figure 11. Example of adding a document from the UI

Screen capture from the Solr UI that shows a JSON document being added to a collection

You can even upload rich document types such as PDF and Word. Take a moment to add some documents into your index or browse the other per-collection capabilities such as the Query interface or the revampedAnalysis screen.

The road ahead

Next-generation search-engine technology gives users the power to decide what to do with their data. This article gave you a good taste of what Lucene and Solr 4 are capable of, and, I hope, a broader sense of how search engines solve non-text-based search problems that involve analytics and recommendations.

Lucene and Solr are in constant motion, thanks to a large sustaining community that’s backed by more than 30 committers and hundreds of contributors. The community is actively developing two main branches: the current officially released 4.x branch and the trunk branch, which represents the next major (5.x) release On the official release branch, the community is committed to backward compatibility and an incremental approach to development that focuses on easy upgrades of current applications. On the trunk branch, the community is a bit less restricted in terms of ensuring compatibility with previous releases. If you want to try out the cutting edge in Lucene or Solr, check the trunk branch of the code out from Subversion or Git (see Resources). Whichever path you choose, you can take advantage of Lucene and Solr for powerful search-based analytics that go well beyond plain text search.


Thanks to David Smiley, Erik Hatcher, Yonik Seeley, and Mike McCandless for their help.



How to Set Up The Ampache Streaming Music Server In Ubuntu


To follow this tutorial you will need:

  1. A PC with Ubuntu 12.04 LTS running LAMP
  2. Your own web address (optional – required for streaming music to external clients like your work computer or cell phone)
  3. Forward port 80 from your router to your ubuntu server (optional – required for #2 above)
  4. SAMBA/SWAT running on your server.

Each of these can be done for free by following each of the three nice tutorials linked below.


NB: The very last step of (4) is to login as a “member of the admin group” – you can ignore this and login to SWAT later when instructed to do so in this tutorial when you actually come to use it.

O.K. I will assume you followed these 4 guides and you now have an ubuntu 12.04 LAMP system running SAMBA/SWAT and a no-ip website address with port 80 forwarded to the ubuntu server.

Create Media directories

Create your media directory and a download directory and give them quite permissive permissions so anyone can access the folders on your network. Please note that you need to exchange <ubuntu username> for your own ubuntu username.

sudo mkdir ~/music
sudo chmod 777 ~/music
sudo mkdir ~/downloads
sudo chmod 777 ~/downloads

The permissions are set with the 777 code and it means that anyone with access to the system can read, edit, and run the files contained within. You could set more restrictive permissions here but I prefer easy access to my media folders.

Setup windows folder sharing using SWAT

On a web browser log in to SWAT as the admin: go to: http://<Ubuntu server hostname>:901 eg: http://amapche:901 Now, login with user name: “root” and your <root user password>.

Click on the shares box at the top. Now click “create share”.

Enter the following into their respective boxes. Please remember to exchange <ubuntu username> to your own ubuntu username.

path:        /home/<ubuntu username>/music
valid users:    <ubuntu username>
read only:    no
available:    yes

Now click commit changes. Now click advanced and set all the “masks” to 0777.

Click “commit changes” again.

Now click basic and “create share”. Repeat the process for your “downloads” folder.

You should now see your ubuntu server on the network and have access to the two shared folders you created provided you remember the <SAMBA user password> you set. Again please note this is quite permissive and gives everyone on the network easy access to the music and downloads folder provided they know the <SAMBA user password> when they try to access the folders.

Now you are ready to download and install ampache. At this point i like to start the copying of my music over to the shared music folder as this will take some time. Hopefully it will be done by the time I’m done installing ampache.

Install Ampache

Download and unpack ampache

Go to your putty terminal and enter

cd ~/downloads

Go to right click the latest tar.gz link and copy the link to the clipboard and paste it into the terminal as in the following line (after typing “sudo wget” then add the “-O ampache.tar.gz” part):

sudo wget -O ampache.tar.gz

Untar the tarball into an appropriate folder:

sudo mkdir /usr/local/src/www
sudo chmod 7777 /usr/local/src/www
sudo tar zxvf ampache.tar.gz -C /usr/local/src/www

Note the name of the root folder eg ampache-3.6-alpha6 use the noted name where you see this in the following text.

Relax the permissions on the extracted folder:

 sudo chmod -R 7777 /usr/local/src/www/ampache-3.6-alpha6

For securities sake, we will, once installation is complete give ownership of the extracted folder to the web-server and tighten permissions up again.

Enable php-gd in apache web server to allow for resizing album art:

sudo apt-get install php5-gd
sudo /etc/init.d/apache2 restart

Create a link from the web-server root folder to the extracted ampache site:

cd /var/www/
sudo ln -s /usr/local/src/www/ampache-3.6-alpha6 ampache

Doing it this way allows us to move websites around and rename them with ease.

Online initial ampache configuration

Go back to your web browser and go to: http://<Ubuntu server hostname>/ampache

If all went well so far you should see the start of the installation process. Note that since you used ubuntu 12.04 LAMP server all the little OK’s are nice and green indicating that the system is ready to have ampache installed.

Click to start the configuration and fill in the boxes as follows

  • Desired Database Name        – ampache
  • MySQL Hostname             – localhost
  • something else I forget        – <leave blank>
  • MySQL Administrative Username     – root
  • MySQL Administrative Password     – <mySQL root password>
  • create database user for new database [check]
  • Ampache database username     – <ampache database user name> eg ampache
  • Ampache database User Password     – <ampache database password> eg ampachexyz123blahblah6htYd4
  • overwrite existing [unchecked]
  • use existing database [unchecked]

Here I recommend that you do NOT use your <ubuntu username> as the <ampache database user name> or the <ubuntu user password> as the <ampache database password> for simplicity. I recommend you use a new username and password here as they will be stored in clear text inside your config file and anyone reading them could gain control over the whole system if you used the ubuntu username and password here.
You only need to remember the database username and password for the next step. They are for ampache to use not you.

Click “insert database” and then fill in the boxes for the next section as follows:

  • Web Path         – /ampache
  • Database Name         – ampache
  • MySQL Hostname         – localhost
  • MySQL port (optional)     – <leave empty>
  • MySQL Username         – <ampache database user name>     or root
  • MySQL Password         – <ampache database password> or mysql root password

Click “write”.

Note that the red words turn into green OK’s.

Click to continue to step 3.

Create the admin user passwords
Ampache admin username    – <apache admin username>
Ampache admin password    – <Ampache admin password>
Repeat the password

Here I usually use a password I can easily remember as I will need to use this to access the ampache site whenever I want to use it.

Click to update the database.

Click return.

The main ampache login screen should open and you can log in with the ampache admin username and password you just set.
DONT forget this password as its quite a pain to reset it.

Now you should see the main ampache interface. Ampache is up and running but needs some more specific configuration before you can use it.

Ampache user specific configuration

Back to the putty terminal.

First we’ll create a space to drop the error logs so you can locate any issues later:

sudo mkdir /var/log/ampache
sudo chmod 7777 /var/log/ampache

Now we’ll create a temp directory to store zip files for users to download songs, albums, and playlists.

sudo mkdir /ziptemp
sudo chmod 7777 /ziptemp

Now we will start editing the ampache config file:

sudo nano /usr/local/src/www/ampache-3.6-alpha6/config/ampache.cfg.php

Scroll slowly down through the file and edit each set of the appropriate lines to read as follow. I have titled the sets to help split up their use.
To speed things up You can use ctrl-w to search for specific config parameters

Network access – allow access from offsite:

require_localnet_session = "false"
access_control  = "true"

Allow zip downloads of songs/albums/playlists etc:

allow_zip_download = "true"
file_zip_download = "true"
file_zip_path = "/ziptemp"
memory_limit = 128

Aesthetic improvement:

resize_images = "true"

Allow comprehensive debugging:

debug = "true"
debug_level = 5
log_path = "/var/log/ampache"

Transcoding settings

Arguably transcoding is simultaneously the most powerful and most frustrating feature of an ampache install but please bare with me because you’ll be thankfull you got it to work. I’m going to explain a little about transcoding then I’m going to give MY version of the configuration file lines as they are in my file and the commands you need to run to install the additional transcoding software that’s required.

The way transcoding works is that ampache takes the original file, which could be in any format, and feeds it to a transcoding program such as avconv, which needs to be installed separately, along with a selection of program arguments depending on how the transcoding needs to be done and into what format the final file needs to be. The transcoding program then attempts to convert the file from one format to the next using the appropriate codec which also needs to be separately installed.

Since the ampache team has no control over the programs and codecs that carry this out there is only minimal help available for the correct syntax of the transcoding commands. I think the idea is that you should read the manual that comes with each program… Anyway, if any of your music is stored in anything other than mp3 you will need to get transcoding to work. Firstly the transcode command tells ampache the main file to use to start the transcode process. Next the encode args are what ampache adds to the main transcode command in order to modify the output format to suit the requested output format. With the new HTML5 player different browsers will request output in different formats for example, Firefox will request ogg, and chrome will request mp3 hence your encode args lines must be correct for both requested formats if you expect the player to work on different systems. At the time of writing the latest version of explorer was not capible of using HTML5 players, the bug is at there end but hopefully they will fix it soon. Until then you will need to use firefox or chrome to access the HTML5 player.

Here are the config lines for my transcode settings you will need to find the lines in the config php file as before and edit them to look as follows:

max_bit_rate = 576
min_bit_rate = 48
transcode_flac = required
transcode_mp3 = allowed
encode_target = mp3
transcode_cmd = "avconv -i %FILE%"
encode_args_ogg = "-vn -b:a max\(%SAMPLE%K\,49K\) -acodec libvorbis -vcodec libtheora -f ogg pipe:1"
encode_args_mp3 = "-vn -b:a %SAMPLE%K -acodec libmp3lame -f mp3 pipe:1"
encode_args_ogv = "-vcodec libtheora -acodec libvorbis -ar 44100 -f ogv pipe:1"
encode_args_mp4 = "-profile:0 baseline -frag_duration 2 -ar 44100 -f mp4 pipe:1"

I may come back and edit these later to improve video transcoding but for now they’ll do for music at least.

You may note the strange encode_args_ogg line. This is because at the time of writing the libvorbis library cant handle outbit sample rates less than 42 or so. For this reason I’ve added a bit to the line to force at least 42K bit rate.

OK save and exit the config file and we will install the missing programs and codec libraries.

If at any time you think you’ve ruined any particular config parameter you can open and check the distribution config file.
Do this with the command

sudo nano /usr/local/src/www/ampache-3.6-alpha6/config/ampache.cfg.php.dist

Transcoding software

Add the repository of multimedia libraries and programs:

sudo nano  /etc/apt/sources.list

Add the following two lines to your sources file:

deb /
deb-src /

Save and quit (ctrl -x then hit y).

Add the repository:

wget -O –|sudo apt-key add -

Install the compiler and package programs:

sudo apt-get -y install build-essential checkinstall cvs subversion git git-core mercurial pkg-config apt-file

Install the audio libraries required to build the avconv audio transcoding:

sudo apt-get -y install lame libvorbis-dev vorbis-tools flac libmp3lame-dev libavcodec-extra*

Install the video libraries required for avconv video transcoding:

sudo apt-get install libfaac-dev libtheora-dev libvpx-dev

At this stage support for video cataloging and streaming in ampache is limited so I have focussed mostly on audio transcoding options with the view to incorporate video transcoding when support within ampache is better. I’ve included some video components as I get them to work.

Now we’re going to try and install the actual transcoding binary “avconv” which is preferred over ffmpeg in ubuntu.
This part will seem pretty deep and complicated but the processes are typical for installing programs from source files in ubuntu.
First you create a directory to store the source files, then you download the source files from the internet, then you configure the compiler with some settings, then you make the program and its libraries into a package, then you install the package.

Note that the “make” process is very time consuming (and scary looking) particularly for avconv. Just grap a cup of coffee and watch the fun unfold until the command line appears waiting for input from you again.
Note also that some online guides suggest “make install” rather than “checkinstall”. In ubuntu I prefer checkinstall as this process generates a simple uninstall process and optionally allows you to easily specify the package dependancies if you so choose enabling you to share the packages with others.

(required) Get the architecture optimised compiler yasm

sudo mkdir /usr/local/src/yasm
cd ~/downloads

You could also go to and check that there isnt a version newer than 1.2.0 if you so choose and replace the link above with its link.

Unzip, configure, build, and install yasm:

tar xzvf yasm-1.2.0.tar.gz
sudo mv yasm-1.2.0 /usr/local/src/yasm/
cd /usr/local/src/yasm/yasm-1.2.0
sudo ./configure
sudo make
sudo checkinstall

Hit enter to select y to create the default docs.
Enter yasm when requested as the name for the installed program package.
Hit enter again and then again to start the process.

(optional) Get the x264 codec – required to enable on-the-fly video transcoding

sudo mkdir /usr/local/src/x264
cd /usr/local/src/
udo git clone git:// x264
$cd /usr/local/src/x264
sudo ./configure –enable-shared –libdir=/usr/local/lib/
sudo make
sudo checkinstall

Hit enter to select y and create the default docs.
Enter x264 when requested as the name for the install.
Hit enter to get to the values edit part.
Select 3 to change the version.
Call it version 1.
Then hit enter again and again to start the process.

(Required) Get the avconv transcoding program

cd /usr/local/src/
sudo git clone git:// avconv
cd /usr/local/src/avconv
sudo LD_LIBRARY_PATH=/usr/local/lib/ ./configure –enable-gpl –enable-libx264 –enable-nonfree –enable-shared –enable-libmp3lame –enable-libvorbis –enable-libtheora –enable-libfaac –enable-libvpx > ~/avconv_configuration.txt

Note the bunch of settings we have set in build configuration – the most important for audio transcoding are –enable-libmp3lame –enable-libvorbis
libmp3lame allows us to transcode to mp3 and libvorbis allows us to transcode to ogg, which is required for the firefox HTML5 player.

Now we’ll build avconv which takes ages. I’ve added the switch -j14 to runn multiple jobs during make. You may or may not have better luck with different values after the j depending on your own cpu architecture. 14 was best for me with my dual core hyperthreaded 8Gb RAM machine.

sudo make -j14
sudo checkinstall

Hit enter to select y to create the default docs.
Enter avconv when requested as the name for the installed program package.
Hit enter again and then again to start the process.

sudo ldconfig
sudo /etc/init.d/apache2 restart

Set up API/RPC access for external client program access

Next, in order that your iphone, android phone, or pimpache device  can access the music we need to set ACL/RPC and streaming permissions. NB: pimpache is not a typo but a “raspberry pi ampache client” project. It’s currently under construction and located on github under “PiPro”.

Go to a browser and login to ampache.

First we’ll create an external user. In the web interface click the gray “admin” filing cabinet on the left.
Click the “add user” link.
Create a username such as <external user>.
Give it a GOOD password <external user password> but also remember that you may well be entering the password for it on your phone.
Set user access to “user”.
Click “add user”.
Click continue.

You can create users for different situations as well as different people. For example I have a “work” user with a lower default transcoding bit rate and a home user with a very high transcoding default bit rate.
Set the transcoding bitrate for a given user by going to the admin section (gray filing cabinet) then clicking “browse users” then the “preferences” icon.
The default transcoding bitrate is then set in the “transcode bitrate” section.
Logged in users may also set there own default transcode bit rate by clicking the “preferences” icon in the menu tab then streaming.

Now we’ll add the API/RPC so the external user can get at the catalog.

In the web interface click the gray “admin” filing cabinet on the left.
Click the “show acls” link.
Now click the “add API/RPC host” link.
Give it an appropriate name such as “external” (no quotes).

  • level “read/write”
  • user external
  • acl type API/RPC
  • start end

Click create ACL.
Click continue.
Now click the “add API/RPC host” link AGAIN.
Give it an appropriate name such as “external” (no quotes).

  • level “read/write”
  • user external
  • acl type API/RPC + Stream access
  • start end

Click create ACL.
You may get the message “DUPLICATE ACL DEFINED”.
That’s ok.
Now click show ACLS and you should see two listed called external with the settings above – one for API/RPC and one for streaming.
You may see 4, thats ok as long ans external (or all) can get streaming and API access).

Now you can install and use many and varied clients to attach to your ampache server and start enjoying its musical goodness.
NB: The phones, viridian clients, etc will ask for a username, password and ampache url. Above, you setup an external username and password to use in just such a situation and the URL is the URL to your server from the outside world + /ampache.

If you want to use an internal network address you may need to specify the ip address rather than the server host name depending on your router DNS system.

Catalog  setup

Finally after all that your music will hopefully have finished copying over and you can create your ampache music catalog.

Click on the gray “admin” filing cabinet in your ampache website.
Click “add a catalog”.
Enter a simple name for your catalog eg “home”.
Enter the path to the music folder you’ve been copying your files across to ie if you created a shared folder and copied your music to it as described above the path is: “/home/<ubuntu username>/music”
(watch out for capital letters here, swap out the <ubuntu username> for the username you used, and dont use the quotes ” ).
Set the “Level catalog type” as local but note that you could potentially chain other catalogs to yours – how cool is that.
If you keep your music organised like I do then leave the filename and folder patterns as is.
Click to “gather album art” and “build playlists” then click “add catalog”.

Final security tidyup

In the putty terminal. We’ll restrict permissions to the extracted ampache folder to protect it from malicious software/people.

sudo chown www-data:www-data /usr/local/src/www/ampache-3.6-alpha6
sudo chmod -R 0700 /usr/local/src/www/ampache-3.6-alpha6/

That should do it.
If you need to move around in that directory again for some reason you will need to make the permissions more relaxed.
You can do this with

sudo chmod -R 0777 /usr/local/src/www/ampache-3.6-alpha6/

Dont forget to do

sudo chmod -R 0700 /usr/local/src/www/ampache-3.6-alpha6/

after to tighten it up again.

Now, go and amaze your friends and family by streaming YOUR music to THEIR PC or phone.


The best help I can offer is to look inside the log file if you followed the how-to above then you can find the log file at /var/log/ampache/.

cd /var/log/ampache

Note the latest file name then access the log file with (for example):

 sudo nano /var/log/ampache/yyyymmdd.ampache.log

Please, if you have any advice re: transcoding commands feel free to leave helpful comments. I think help on this is really hard to come by.

For starters the best info I can find for avconv in general is at

If you get permission errors when trying to copy to the music folder try again to relax the permissions on this folder with

sudo chmod 777 ~/music

Messed it up and want to start again from scratch?

Instead of reinstalling  ubuntu and LAMP and SAMBA you can delete ampache and its database with:

sudo mysql -u root -p

Enter your mysql password:

drop database ampache;


sudo rm -R /usr/local/src/www/ampache-3.6-alpha6
cd ~/downloads
sudo tar zxvf ampache.tar.gz -C /usr/local/src/www
sudo chmod -R 7777 /usr/local/src/www/ampache-3.6-alpha6  

That should get you reset with all the default settings and ready to try again from the intial web logon.