Finding duplicate lines in two files

In a recent article, I talked about the code I use to update & publish this site. Very exciting, I know. This is adding a new feature to that publisher. It's also about the trade-offs in writing code. When should code be generic vs highly focused on a single problem?

The generic problem is finding duplicate lines in two files. If you work as a developer you get broad requirements like this all the time. "Find duplicate lines in two files" sounds super simple until you start asking questions:

Do the lines need to be in the same position?
Do you care about case sensitivity?
Do you care about duplicate lines in the same file?
Do you need to know the line numbers that matched?

Asking these questions back to the requester doesn't always work. I find it's better to ask "what are you planning to do with this information?" That should get you to a description of what they actually want to accomplish.

So what am I trying to accomplish?

I run the Pi-hole network ad blocker at home. I highly recommend it. Pi-hole's default block list is this one: raw.githubusercontent.com/StevenBlack/hosts/. Great list and everything but sometimes I run into things it missed. I keep my additions here: huguesjohnson.com/pihole.txt. Pi-hole can consume multiple lists so every time mine updates it pulls the latest version of both. Hooray.

For my web publishing tool, I want to check my Pi-hole additions against the default list. If there are duplicates I'll remove them from my list manually later, it's not an urgent problem. I'd also like to know about duplicate entries in my list which is also not urgent to fix.

Going back to the previous design questions:

Do the lines need to be in the same position? Nope.
Do you care about case sensitivity? Not at all.
Do you care about duplicate lines in the same file? That would be nice, yes.
Do you need to know the line numbers that matched? Also a no.

The general problem can be solved in a few ways. The two I'm thinking of are:

1) Go line-by-line through one file and search the other file for each line. This sounds like a bad way to go about it. The code is simple, the execution is awful. This is going to be very slow. One file will be scanned exactly once. The other will be (at least partially) scanned by the number of lines in the first file. If there are very few hits then the second file will be fully scanned a lot. It's been over 20 years since I thought about this, but I think that is O(n²). Hmm, maybe it's really O(nⁿ) which is even worse.

2) Sort the lines in each file alphabetically and scroll through them in order. This is also not going to be blazing fast but ensures we scan each file exactly once. The initial sort is time consuming but overall it's a better trade-off. It means keeping both files in memory which could be bad in some cases. There are ways to work around that. Again, it comes down to the requirements. That should be a more simple O(n), maybe O(2n) at worst. Again, not counting sort complexity.

So I'm going to solve it like the second option, even then there are two options. We'll start with the more general version.

Before we get that, let's start with the simple part - reading a file from the internet into an array(list) of lines:


public static ArrayList<String> readLines(URL url) throws IOException{
  ArrayList<String> lines=new ArrayList<String>();
  URLConnection connection=url.openConnection();
  InputStreamReader in=new InputStreamReader(connection.getInputStream());
  BufferedReader br=new BufferedReader(in);
  String line;
  while((line=br.readLine())!=null){
    lines.add(line);
  }
  br.close();
  return(lines);
}

That's all there is. Now mileage may vary by site being read. Web servers (firewalls or Apache rules really) might not like the default Java useragent. Or if you run this 100s of times maybe they'll stop liking your IP address.

We just need a quick wrapper to return a sorted version. This is a separate method because in most cases there's not a good reason to sort the lines of a web page.


//readLines but sorting the results
public static ArrayList<String> readLines(URL url,Comparator<String> sortComparator) throws IOException{
  ArrayList<String> linesSorted=readLines(url);
  linesSorted.sort(sortComparator);
  return(linesSorted);
}

Let's walk through the algorithm in a general sense. We start with two unsorted lists:

After sorting them we get:

Now we point to the first element in each list and compare them:

The 1st entry in list 1 is alphabetically before the 1st entry in list 2. So we move down list 1 and hit a match right away:

After that we advance in both lists:

Now we keep advancing in list 2 until it is ahead (alphabetically) from list 1:

Then repeat but for list 1, which only advances 1 index:

Continue advancing through lists some more

In the next round the index of list 2 jumps a few spots again:

After incrementing the index of list 1, there's another match:

So both indexes are incremented again:

And keep advancing through the lists:

Until another match is found:

I don't need to include every step from this point but will just to see it through to the end:

We've reached the end of list 1:

And now the end of list 2:

Since this application is trying to solve a specific problem, we could slightly speed up the search by scrubbing values that aren't useful. Here's what the initial lists would look like:

After that the search works like before but with fewer rows to look at and shorter entry strings to compare:

And so on...

Otherwise this works like the previous example

It's all simply a trade-off in what we want to accomplish.

The code to search both lists turned out to be simple:


/*
* returns duplicate entries in two lists
* assumes both lists are sorted, if they are not this will fail
* this is case sensitive but is trivial to add a parameter to ignore case
*/
public static ArrayList<String> findDuplicateRows(List<String> list1,List<String> list2){
  ArrayList<String> duplicates=new ArrayList<String>();
  int size1=list1.size();
  int size2=list2.size();
  int index1=0;
  int index2=0;
  while((index1<size1)&&(index2<size2)){
    String item1=list1.get(index1);
    String item2=list2.get(index2);
    int compare=item1.compareTo(item2);
    if(compare==0){
      duplicates.add(item1);
      index1++;
      index2++;
    }else if(compare<0){
      index1++;
    }else{//compare>0
      index2++;
    }
  }
  return(duplicates);
}

The unit tests are in Github. They are too long and uninteresting to post here.

Finally, some stub code to run the compare and show the results.


ArrayList<String> a1=FileUtils.readLines((new URI("https://huguesjohnson.com/pihole.txt")).toURL(),new StringComparator());
ArrayList<String> a2=FileUtils.readLines((new URI("https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts")).toURL(),new StringComparator());
ArrayList<String> duplicates=ArrayUtil.findDuplicateRows(a1,a2);
if(duplicates.size()<1){
  //log no duplicate entries found here
}else{
  for(String s:duplicates){
    if(!s.startsWith("#")){
      //log the duplicate entry here
    }
  }                    
}

Alright, this was simple in the end. I spent more time on the diagrams than the code.

Tags: Java

Exporting Cloudfront logs to an OpenDocument spreadsheet using Java

Working with .ips files in Java

A simple templater in Java

Related