This is a follow-up to a previous article about displaying Cloudfront logs on a small screen connected to a Raspberry Pi. Apologies if I skim over some details covered there. This time around I'm running roughly the same queries but exporting them to an OpenDocument spreadsheet. There's a reason I suddenly want to do this...
I really don't care about how much traffic this site gets until it costs me money. Sometimes something here goes viral and I get a few million hits in a day. That costs me about an extra dollar for the whole month, hardly an issue. Lately though, incredibly inconsiderate AI scrapers have been a problem. They generate traffic that looks more like every page on this site going viral on every day of the month. They seem to be throwing a lot of hardware at scrapers. They could probably save a ton by using less hardware and more brain cells.
The first culprit was something called ClaudeBot, which I blocked. This is a very, very dumb bot. Very dumb. It always tries to access things over plain http first. Almost every site on the internet is https and every site worth scraping definitely is. It then gets an https redirect and downloads what it was after. Next time it goes for the same file it again tries plain http first. It is incapable of learning from previous mistakes, a thing a supposed AI company should do well. "Next time" was the important part though. It re-scans the same files 1000s of times. As though an Electronics Boutique catalog scan from 1993 is changing every second. Did I mention how dumb ClaudeBot is? Lots of other bots scan this entire site all the time and I don't care. ClaudeBot does it in such an atrociously bad way that it more than doubled my hosting costs until I blocked it. I don't know who Claude is but he really needs to work on his bot writing skills.
So now I care a lot about user agents that generate, I dunno, 100x more traffic than all the others combined. That means I need to run a daily report looking for potential problems. Since I want to run this every day it should be something I can stick in a .sh file and schedule to run. I debated between having it export to html or a spreadsheet. This library looked like it handled spreadsheet creation well so that's what I'm trying.
The first modification from the previous article is putting all the configuration into a JSON file. The simplest, and least enterprise-y, way to accomplish this is creating a (de-facto) struct class that a library like Gson can read/write. Gson uses reflection to read/write values so typical Java getters & setters are pointless.
import java.util.Map;
import com.github.miachm.sods.Color;
//effectively a struct
public class CloudfrontReportSettings{
public int maxQueryTime;
public int threadSleep;
public Color headerBGColor;
public Color fontColor;
public int fontSize;
public String bucket;
public String region;
public String reportLocation;
public Map<String, String> queryList;
}
I'm just now questioning why this other project created their own implementation of Color. Whatever, who cares. In JSON form, this configuration file looks like:
{
"maxQueryTime":30000,
"threadSleep":1000,
"headerBGColor":{"red":212,"green":214,"blue":227},
"fontColor":{"red":68,"green":77,"blue":87},
"fontSize":12,
"bucket":"s3://[BUCKET_NAME]/",
"region":"[AWS_REGION]",
"reportLocation":"CloudfrontReport.ods",
"queryList":{
"Top user agents":"select user_agent,
count(*) as c from cloudfront_logs
where status \u003e 199
and status \u003c 400
and user_agent \u003c\u003e \u0027-\u0027
and user_agent is not null
group by user_agent
order by c desc limit 100"}}
This is just demoing "top 100 user agents" because I think that's easy to understand. I only keep Cloudfront logs for two days, if you keep them longer you'll probably want some query parameters to limit the date range.
The code to pull logs is nearly identical to the previous article. CTRL+F "if((athenaClient!=null)&&(!fatalError))" to jump to the new code that creates a spreadsheet. The complete steps to this program go like:
We can debate the error handling and general program flow. This is a quick implementation that I will probably never touch again.
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Iterator;
import java.util.List;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.athena.AmazonAthena;
import com.amazonaws.services.athena.AmazonAthenaClient;
import com.amazonaws.services.athena.model.Datum;
import com.amazonaws.services.athena.model.GetQueryExecutionRequest;
import com.amazonaws.services.athena.model.GetQueryExecutionResult;
import com.amazonaws.services.athena.model.GetQueryResultsRequest;
import com.amazonaws.services.athena.model.GetQueryResultsResult;
import com.amazonaws.services.athena.model.ResultConfiguration;
import com.amazonaws.services.athena.model.Row;
import com.amazonaws.services.athena.model.StartQueryExecutionRequest;
import com.amazonaws.services.athena.model.StartQueryExecutionResult;
import com.github.miachm.sods.Sheet;
import com.github.miachm.sods.SpreadSheet;
import com.google.gson.Gson;
import com.huguesjohnson.dubbel.aws.GetQueryExecutionResultStatus;
public class CloudfrontReport{
public static void main(String[] args) throws Exception{
//used to terminate if something unrecoverable happens
boolean fatalError=false;
CloudfrontReportSettings settings=null;
if(args.length<1){
fatalError=true;
throw(new Exception("No settings file specified"));
}else{
System.out.println("Reading settings from: "+args[0]);
String json=Files.readString(Paths.get(args[0]));
settings=(new Gson()).fromJson(json,CloudfrontReportSettings.class);
}
AWSCredentials credentials=null;
if(!fatalError){
//setup aws credentials & athena client
try{
credentials=new ProfileCredentialsProvider("default").getCredentials();
System.out.println("Created credentials");
}catch(Exception x){
x.printStackTrace();
fatalError=true;
}
}
AmazonAthena athenaClient=null;
if((credentials!=null)&&(!fatalError)){
try{
athenaClient=AmazonAthenaClient.builder()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(settings.region)
.build();
System.out.println("Created AthenaClient");
}catch(Exception x){
x.printStackTrace();
fatalError=true;
}
}
if((athenaClient!=null)&&(!fatalError)){
try{
SpreadSheet spread=new SpreadSheet();
for(String key:settings.queryList.keySet()){
List<Row> rs=runQuery(athenaClient,settings.queryList.get(key),settings);
int rows=rs.size();
if(rows>0){
int columns=rs.get(0).getData().size();
if(columns>0){
int rowNum=0;
Sheet sheet=new Sheet(key,rows,columns);
Iterator<Row> ir=rs.iterator();
while(ir.hasNext()){
int colNum=0;
Row r=ir.next();
List<Datum> ld=r.getData();
Iterator<Datum> id=ld.iterator();
while(id.hasNext()){
String d=id.next().toString().replace("{VarCharValue: ","").replace("}","").replace("%20","");
sheet.getRange(rowNum,colNum).setValue(d);
colNum++;
}
rowNum++;
}
//format sheet
sheet.getRange(0,0,1,columns).setFontBold(true);
sheet.getRange(0,0,1,columns).setBackgroundColor(settings.headerBGColor);
sheet.getDataRange().setFontColor(settings.fontColor);
sheet.getDataRange().setFontSize(settings.fontSize);
spread.appendSheet(sheet);
}
}
}
spread.save(new File(settings.reportLocation));
System.out.println("Saved to: "+settings.reportLocation);
}catch(Exception outerLoopEx){
outerLoopEx.printStackTrace();
fatalError=true;
}
}
}
private static List<Row> runQuery(AmazonAthena athenaClient,String query,CloudfrontReportSettings settings) throws Exception{
List<Row> rs=null;
StartQueryExecutionRequest sqer=new StartQueryExecutionRequest();
sqer.setQueryString(query);
sqer.setSdkClientExecutionTimeout(settings.maxQueryTime);
ResultConfiguration rc=new ResultConfiguration();
rc.setOutputLocation(settings.bucket);
sqer.setResultConfiguration(rc);
//send the query off
StartQueryExecutionResult sqResult=athenaClient.startQueryExecution(sqer);
String exid=sqResult.getQueryExecutionId();
//wait for the query to complete
int runTime=settings.maxQueryTime; //belt and suspenders approach to timeout
boolean queryRunning=true;
GetQueryExecutionRequest qexr=new GetQueryExecutionRequest();
qexr.setQueryExecutionId(exid);
while((runTime>0)&&queryRunning){
//get the status of the query execution
GetQueryExecutionResult qxResult=athenaClient.getQueryExecution(qexr);
String state=qxResult.getQueryExecution().getStatus().getState();
if(state.equals(GetQueryExecutionResultStatus.STATUS_FAILED)||state.equals(GetQueryExecutionResultStatus.STATUS_CANCELLED)){
System.out.println(qxResult.getQueryExecution().getStatus().getStateChangeReason());
throw(new Exception("Query failed with status: "+state));
}else if(state.equals(GetQueryExecutionResultStatus.STATUS_SUBMITTED)||state.equals(GetQueryExecutionResultStatus.STATUS_RUNNING)||state.equals(GetQueryExecutionResultStatus.STATUS_QUEUED)){
runTime-=settings.threadSleep; //good enough place to do this
}else if(state.equals(GetQueryExecutionResultStatus.STATUS_SUCCEEDED)){
queryRunning=false;
}else{
throw(new Exception("Unknown status: "+state));
}
if(queryRunning){
Thread.sleep(settings.threadSleep);
}else{ //query completed successfully - show the results
GetQueryResultsRequest gqrr=new GetQueryResultsRequest();
gqrr.setQueryExecutionId(exid);
GetQueryResultsResult gqResult=athenaClient.getQueryResults(gqrr);
rs=gqResult.getResultSet().getRows();
}
}
if(runTime==0){
throw(new Exception("Query took longer than "+settings.maxQueryTime+"ms to respond"));
}
return(rs);
}
}
After packaging, this can be run from a script like:
echo "Launching CloudfrontReport.jar..."
java -jar CloudfrontReport.jar cloudfront-report.json
Alright, this was a short one but I think I'm going to try and post more frequent programming articles for a little while.
Related