Http_poller input only new lines?

I’m using the http_poller Logstash input plugin to ingest a logfile into Elasticsearch. But every time it polls data from the logfile, it polls the whole file. Config file:

input {
  http_poller {
    urls => {
      test => {
        method => get
        url => ""
        headers => {
          "Accept" => "application/json"
          "x-xx-api" => "xxxxx"
    request_timeout => 20
    # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
    schedule => { cron => "* * * * * UTC"}
    codec => "json_lines"
    # A hash of request metadata info (timing, response headers, etc.) will be sent here
    metadata_target => "http_poller_metadata"

output {
    elasticsearch {
        hosts => [ "" ]
	index => "xx-testing-%{+YYYY.MM}"
     stdout {
	codec => rubydebug

Log file looks like this:

{"@message":"Successful api request","@timestamp":"2018-01-11T10:11:00.260Z","@fields":{"origin":"xx.xx.xx.xx","environment":"production_beta","label":"askquestiongui","level":"info"}}
{"@message":"Successful api request","@timestamp":"2018-01-11T10:12:00.317Z","@fields":{"origin":"xx.xx.xx.xx","environment":"production_beta","label":"askquestiongui","level":"info"}}

If I use codec “json”, I only get the first log-line once, codec “json_lines” writes the complete logfile to Elasticsearch each time. PLease advice. :slight_smile:

why do not use FileBeat to produce events based on your log and send directly to Elastic?


Doesn’t filebeat have to be run at the server where the log file is? I cannot install anything on the server where the log-file is located.

Another issue for me is that I have to pull the log file from the server, which is on a public server, to my Elasticsearch server that is on a private network. I cannot use filebeat to push the data tom Elasticsearch.

Your http server will need to stateful.
Say you use a query string of and the server, having remembered that in the previous call it served lines 0 to 99, serves lines 100 to 199 to this call.
The http_poller input is not stateful and has no facility to remember what the last processed line number was and adjust the query string for example.

I think I got it right now. I added:

        document_id => "%{@timestamp}"

…to my elasticsearch output. Then Elasticsearch doesn’t duplicate that document_id. Before it gave every new reading a unique document_id.

Consider a different architecture.
Put Filebeat in the public zone on the server, Logstash in the DMZ and ES in the private zone.

We are aiming at having Elastic on a public server, but I cannot use Filebeat since we cannot install anything in the environment where the logfile are located. But our logging system uses the Winston library which can send logging messages directly to Logstash, so whenever I get a public server running, I think that may be an excellent way to go. :slight_smile:

I just read the Winston docs and some of the code. It looks like it will try to dispatch the log line string to a destination immediately. The HTTP transport is acting as a client not a server AFAICT.
I don’t see how you are achieving persistence - via a Winston File transport? If so then the file is a persistent buffer. Then, with what tech does the LS http_poller connect to so it retrieves the log lines from those files?
I ask these questions not out of malice or because I doubt your solution but because I and others here can get to appreciate an alternative method to ship log lines from the edge.
Regarding Elasticsearch clusters in the public zone, if you have not already done it, you must secure it
Regarding your future plans.

But our logging system uses the Winston library which can send logging messages directly to Logstash
By this I think you mean Winston HTTP transport (client mode) to LS http input (server mode). If so, there is a problem with buffering. LS will have to be up 24/7. How does the Winston client transport behave when the HTTP server is not available? Consider a load balancer between Winston and 2/3 LS instances (haproxy or nginx). If you consider a load balancer, then remember that consecutive log lines will be sent to different LS instance - no ordering.

Thanks for the input. :slight_smile:

We have a few limitations in this project that currently cause us some problems, but I think this method I’m using is a fairly good way to overcome thoose issues. Having Logstash use the http_poller input with a private api-key to fetch the data. I don’t see much difference (performance wise) between having logstash to pull data from a server compared to having the logg-server to send the data to the Elastic server. Right now this setup will only run for a few weeks as a proof of concept. If we launch it properly, we have to scale everything alot anyway. :wink:

However you did not answer the question of what tech you are using the serve the requests from the http_poller?

We are running an application in IBM Cloud, and it’s a little restricted what you can and cannot do there. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.