Skip to main content

Getting Started with Apache Flume NG: Flows, Agents and Syslog-to-HDFS Examples

Apache Flume NG replaced the original master/collector architecture with lightweight agents that can be wired together to form flexible data flows. This guide explains what changed with Flume NG, how the agent–channel–sink model works, and walks through simple configurations for syslog ingestion to a console logger and to HDFS. It’s aimed at engineers who still operate Flume in legacy estates or need to understand it for migrations.

From Flume to Flume NG

Apache Flume is a distributed log and event collection service. With Flume NG, the project moved away from the original master/client and node/collector design and adopted a simpler, more robust architecture based on standalone agents.

Key changes introduced by Flume NG:

  • No external coordination service required for basic operation.
  • No master/client or node/collector roles—only agents.
  • Agents can be chained together to build arbitrary flows and fan-in/fan-out patterns.
  • Lightweight runtime; small heap sizes are sufficient for simple pipelines.
  • General-purpose exec source instead of dedicated tail/tailDir sources.

Requirements

To build Flume from source you need:

  • A JDK (e.g. 1.6+ in the original context; use a supported JDK for modern builds).
  • Maven 3.x.
  • Git or Subversion to fetch the source code.

Building Flume NG from Source

You can check out and build Flume using Git and Maven:

git clone git://git.apache.org/flume.git
cd flume
git checkout trunk
mvn clean
mvn package -DskipTests

After a successful build, the distribution artifacts are located under:

flume-ng-dist/target

Copy the desired distribution archive to the host where you want to run Flume, unpack it and you are ready to start configuring agents.

What Is a Flow in Flume NG?

A flow describes the full path of events from Source to Channel to Sink. Sinks can also feed into other agents, effectively becoming sources for downstream flows.

Conceptually, flows can look like this:

source                -> source => channel => sink
    \                /
     \-> channel => sink

source                -> channel => source => channel => sink

Flume NG runs one or more agents. Each agent hosts its own configured sources, channels and sinks.

The Configuration Model

Flume NG configuration is text-based and follows a logical pattern. For each agent, you declare:

  • A list of sources.
  • A list of channels.
  • A list of sinks.

The naming scheme is:

<agentName>.sources
<agentName>.channels
<agentName>.sinks

and for each component:

<agentName>.sources.<sourceName>.property = value
<agentName>.channels.<channelName>.property = value
<agentName>.sinks.<sinkName>.property = value

You are free to choose meaningful names for sources, channels and sinks; those names become the identifiers you wire together.

Example 1: Syslog to Console Logger

The following configuration (syslog-agent.cnf) defines a simple flow:

  • Source: receives syslog over TCP.
  • Channel: in-memory channel.
  • Sink: logger sink, prints events to stdout for debugging.
syslog-agent.sources  = Syslog
syslog-agent.channels = MemoryChannel-1
syslog-agent.sinks    = Console

# Source definition
syslog-agent.sources.Syslog.type = syslogTcp
syslog-agent.sources.Syslog.port = 5140

# Wiring
syslog-agent.sources.Syslog.channels = MemoryChannel-1
syslog-agent.sinks.Console.channel   = MemoryChannel-1

# Sink definition
syslog-agent.sinks.Console.type = logger

# Channel definition
syslog-agent.channels.MemoryChannel-1.type = memory

In this example, the agent syslog-agent listens on TCP port 5140 for syslog messages and writes every event to the console via the logger sink.

Example 2: Syslog to HDFS

To persist events into HDFS instead of logging to stdout, you can swap the sink to an HDFS sink:

syslog-agent.sources  = Syslog
syslog-agent.channels = MemoryChannel-1
syslog-agent.sinks    = HDFS-LAB

# Source definition
syslog-agent.sources.Syslog.type = syslogTcp
syslog-agent.sources.Syslog.port = 5140

# Wiring
syslog-agent.sources.Syslog.channels   = MemoryChannel-1
syslog-agent.sinks.HDFS-LAB.channel    = MemoryChannel-1

# HDFS sink definition
syslog-agent.sinks.HDFS-LAB.type            = hdfs
syslog-agent.sinks.HDFS-LAB.hdfs.path       = hdfs://NN.URI:PORT/flumetest/%{host}
syslog-agent.sinks.HDFS-LAB.hdfs.filePrefix = syslogfiles
syslog-agent.sinks.HDFS-LAB.hdfs.rollInterval = 60
syslog-agent.sinks.HDFS-LAB.hdfs.fileType   = SequenceFile

# Channel definition
syslog-agent.channels.MemoryChannel-1.type = memory

This configuration listens for syslog events and writes them into HDFS, rolling files every 60 seconds with the prefix syslogfiles.

Starting an Agent

Flume NG runs one agent per process. To start an agent with a specific configuration file:

bin/flume-ng agent -n YOUR_AGENT_NAME -f YOUR_CONFIG_FILE

For the syslog example:

bin/flume-ng agent -n syslog-agent -f conf/syslog-agent.cnf

Once started, the agent will bind to the configured syslog port and begin routing events through the defined channel and sink.

Further Reading

If you need help with distributed systems, backend engineering, or data platforms, check my Services.