Thứ Hai, 21 tháng 11, 2016

Keedio FTP Flume Source

banner_0009
Keedio-flume-ftp was created to meet the need of processing information stored on a FTP server. Information is processed by Apache Flume, whose base data information unit is an “event”.
Usually, in an FTP server, data is loaded in bulk, which is a completely different usage paradigm than the event-based paradigm on which Flume relies.
A VERY BRIEF INTRODUCTION TO APACHE FLUME:
Apache Flume is a widely used, versatile and extremely extensible ingestion framework for BigData deployments. At Keedio we try (as much as possible) to centralize data ingestion through Apache Flume.
An event, in Flume, is composed by two parts:
  • an header: contains event meta information that identifies the event.
  • the payload: the piece of information to be inyected by the plugin.
A working Flume deployment is composed by, at least, three components:
  • a source: the component is responsible of receiving (or fetching) data from the data source (usually an external system).
  • a sink: the component responsible for the storage of the data being ingested into another system (a file system, another process, etc).
  • a channel: the link between the source and the sink.
Within this scheme, our `keedio-flume-ftp` plugin is a custom implementation of a “source” component which is responsible for the injection of data into the chain of aforementioned components. Keedio-flume-ftp takes into account the complexity of remote file changes (new files are added, file are changed, deleted or moved to other remote folders) and reacts accordingly.
how keedio ftp source works:
This plugin connects to an FTP server according to the specifications imposed by the RFC 930 standard.
If security in the network connection is critical in your environment, `keedio-flume-ftp` allows you to secure the connection using cryptographic protocols like SSL and TLS, with or without certificate validation, and even using a secure connection via SSH under a single channel in the transport layer of the OSI model.
If the TCP connection is successful, the plugin goes a step further and recursively search for files beginning at the FTP root (obviously under the permission constraints imposed by the FTP server).
Each discovered file is treated as a data stream. How Flume “events” are generated is configurable and depends on how the ingested data will be processed. For example, the pattern of events generated for a binary file is not the same as the pattern of events for a text file containing logs.
Our plugin rely mainly on the amazing commons-net. This dependency includes the classes required to connect to the remote server, providing an abstraction layer that allows to access remote files, and their properties, as if they were local.
information processing:
Our plugin enables Flume to connect to a remote FTP server hosting the information to be ingested as binary or text files.
A common use case is to ingest information from plain text files. These files usually contain information separated by a newline or a specific character. Kedio-ftp-source can be configured to read text files and generate events using a custom delimiter. In a log file for example, the newline character is a good delimiter: in this simple use case keedio-flume-ftp will generate an event for each line in the file.
Another very common use case is data extraction from binary files, where our plugin will generate Flume events whose payload is composed by chunks of bytes of a configurable fixed size. Even though this behaviour is especially useful for binary files, it can also be applied to text files, as long as the events generated will make sense for the downstream actor(s) processing this data.
This plugin assumes remote folders contain static or append-only files, the behaviour with files whose content changes randomly is unpredictable. We strongly discourage using this plugin in this scenario.
tracking data:
keedio-flume-ftp has been designed to both discover new files and to detect changes in previously processed files, ie new
information has been appended to the file. The discovery process is triggered at regular, user-configurable, intervals.
Keedio-flume-ftp periodically checkpoints metadata regarding processed files in a “status” file (the checkpoint interval is configurable). So, if the Flume agent dies unexpectedly (or is manually shutted down), the processing of a certain file is resumed from where it left off upon restart of the agent.
The processing paradigm is “at-least-once”, since it may occour that the agent dies before having checkpointed a bunch of events already sent to the underlying channel.
Changes to an already processed file is done comparing the actual size of the file with the previous one. File sizes are included in the status file). For each file, one of the following might occur:
  • File size is the same: the file has not changed, no need to process it.
  • The file size has increased: the plugin will start processing the file from the last offset stored in the status file.
  • The file size has decreased: the plugin treat the file as if it were new, starting to process it from offset zero.
execution and configuration:
Deployment of the keedio-flume-ftp is as easy as copying the generated jar to the `plugins.d/lib` folder inside your Apache Flume installation root.
Take the time to read the project README for instructions on how to build and configure the plugin.
conclusions:
We gave a brief introduction on how our flume ftp source works, you’re welcome to try it, we we would really appreciate any feedback you can give us!
This plugin is pretty stable and has been running in production in several banking environments for several months now. In a future post will we give a real world usage example of the source under different conditions (FTP, SFTP, FTPS) and to monitor the source behaviour remotely using JMX.
Nguồn: https://www.keedio.org/keedio-ftp-flume-source/

Không có nhận xét nào :

Đăng nhận xét