Searching for English words in .bin files (or any language or file type really)

This is an exercise in procrastination. I should be working on another Genesis demo but have been very unmotivated recently. Instead I find myself curious about other things like "I wonder which PC Engine CD games have English text hidden somewhere in their code?"

Legend of Heroes II baited this curiosity. For unknown reasons there is a bunch of text from the English version Ys II in it. Did I manage to hit the one and only game with something weird like this? That would be quite a lucky coincidence.

Finding blocks of ASCII text in a file is a trivial task. Finding something that is likely to be English is much more difficult. I studied artificial intelligence in graduate school and know enough to understand how challenging it is.

Graduate school was a long time ago. If I tried to solve this problem then it would be a lot of work. GitHub didn't exist then, even SourceForge was only a year old. I'd be stuck trying to write an English parser myself. That would be fun for like a month until I went mad. Today though I have many options.

Let's start with a library called lingua because it looks complete and easy to use. If it doesn't work we'll try something else.

First let's write the code I claimed was trivial, parsing ASCII strings:


public abstract class AsciiStringFinder{

 /**
  * Looks for blocks of consecutive ascii characters in a file.
  * Not optimized for memory use at all.
  * @param filePath full path to the file to search
  * @param minStringLength the minimum length a string needs to be before being added to the return map
  * @param acceptLetters whether to accept letters when searching (a-z,A-Z)
  * @param acceptNum whether to accept numbers when searching (0-9)
  * @param acceptSpecial whether to accept special characters when searching
  * @param acceptCustom array of bytes that should always be accepted - using this is slow, null is an acceptable value
  * @return a map in the form of <address,string found>
  */
 public final static Map<String,String> findInFile(String filePath,int minStringLength,boolean acceptLetters,boolean acceptNum,boolean acceptSpecial,byte[] acceptCustom){
  //using LinkedHashMap means the results will be sorted by address
  Map<String,String> map=new LinkedHashMap<String,String>();
  AsciiStringFinder.AcceptByte ab=new AsciiStringFinder.AcceptByte(acceptLetters,acceptNum,acceptSpecial,acceptCustom);
  try{
   byte[] f=Files.readAllBytes((new File(filePath)).toPath());
   int i=0;
   while(i<f.length){
    if(ab.acceptByte(f[i])){
     int end=+i;
      while((end<f.length)&&(ab.acceptByte(f[end]))){
       end++;
      }
      if((end-i)>=minStringLength){
       byte[] sub=(byte[])Arrays.copyOfRange(f,i,end);
       map.put("0x"+Integer.toHexString(i),((new String(sub)).toString()));
       i=end;
      }
    }
    i++;
   }
  }catch(Exception x){
   x.printStackTrace();   
  } 
  return(map);
 }
 
 protected static class AcceptByte{
  final boolean acceptLetters;
  final boolean acceptNum;
  final boolean acceptSpecial;
  final byte[] acceptCustom;
  
  AcceptByte(boolean acceptLetters,boolean acceptNum,boolean acceptSpecial,byte[] acceptCustom){
   this.acceptLetters=acceptLetters;
   this.acceptNum=acceptNum;
   this.acceptSpecial=acceptSpecial;
   this.acceptCustom=acceptCustom;
  }

  //cheat sheet
  //0-31 = non-printing characters & symbols
  //32 = space (always accepted)
  //33-47 = special characters
  //48-57 = numbers
  //58-64 = more special characters
  //65-90 = uppercase letters
  //91-96 = more special characters
  //97-122 = lowercase letters
  //123-126 = yet more special characters
  //127 = delete
  protected boolean acceptByte(byte b){
   if(this.acceptLetters){
    if((b>=65)&&(b<=90)){return(true);}
    if((b>=97)&&(b<=122)){return(true);}
   }
   if(this.acceptNum){
    if((b>=48)&&(b<=57)){return(true);}
   }
   if(this.acceptSpecial){
    if((b>=33)&&(b<=47)){return(true);}
    if((b>=58)&&(b<=64)){return(true);}
    if((b>=123)&&(b<=126)){return(true);}
   }
   if(b==32){return(true);}
   if(this.acceptCustom!=null){
    for(byte ac:this.acceptCustom){
     if(b==ac){return(true);}
    }
   }
   return(false);
  }
 }
}

The latest version of this particular code should be here: https://github.com/huguesjohnson/DubbelLib

You may be thinking "that doesn't look as trivial as you made it sound you weaselly liar" and you would be wrong. It's written to allow a caller maximum flexibility. Whether you care about letters, numbers, or special characters really depends on what you're trying to accomplish with the search. The byte[] acceptCustom part is to address searching files with known non-ASCII characters that appear in strings. If you need a nap I can explain how various non-ASCII characters are used to handle formatting in Phantasy Star III.

This code is obviously not optimized for memory or speed. If you care about memory try replacing Files.readAllBytes with a streaming reader. Profiling acceptByte over a large data set would identify some improvements I'm sure. Adding an upfront check like if((b<32)||(b>126)&&this.acceptCustom==null){return(false);} might speed it up, it also might make it slower. This is a thing I'm going to run once, maybe twice, so it's not important.

This gets us a list of ASCII strings in an arbitrary file. Let's bring in lingua to see if we can find anything resembling English:


public static void main(String[] args){
 final String binPath="Dragon Slayer - The Legend of Heroes II (Japan) (Track 02).bin";
 final String csvPath="Dragon Slayer - The Legend of Heroes II (Japan) (Track 02).txt";
 LanguageDetector ld=LanguageDetectorBuilder.fromAllLanguagesWithLatinScript().build();
 Map<String,String> map=AsciiStringFinder.findInFile(binPath,10,true,true,false,null);
 FileWriter writer=null;
 try{
  writer=new FileWriter(csvPath);
  writer.write("address|detectedstring\n");
  for(Map.Entry<String,String> entry:map.entrySet()){
   String address=entry.getKey();
   String stringCandidate=entry.getValue();
   Map<Language,Double> confidence=ld.computeLanguageConfidenceValues(stringCandidate);
   Language l=ld.detectLanguageOf(stringCandidate);
   if(l.equals(Language.ENGLISH)){
    double score=confidence.getOrDefault(Language.ENGLISH,Double.MIN_VALUE);
    if(score>0d){
     writer.write(address);
     writer.write("|");
     writer.write(stringCandidate);
     writer.write("\n");
    }
   }
  }
 }catch(Exception x){
  x.printStackTrace();
 }finally{
  try{if(writer!=null){writer.flush(); writer.close();}}catch(Exception x){ }
 }
 System.out.println("done and whatever");
}

fromAllLanguagesWithLatinScript() is a sneakily important part. LanguageDetector requires a minimum of two languages to work. If you set it up to only detect English and any other language you will get horrible results. I had it saying strings like "77u!;e9z" were English. Changing it to include all Latin script languages improved the accuracy for detecting English. It now said strings like "ffffffffffffffffffffffffffffffffffffff" were Welsh but perhaps that's true.

(I've never received a hate mail in Welsh before so here's your opportunity to be the first)

I picked a game that I knew had English text and sure enough lingua found it:


[...]
0x4fbcd1|he message engraved on
0x4fbce9|A slate has been worn thinthrough
0x4fbd3a|Wsix powerful books
0x4fbd67|he forces contained w
0x4fbeb3|A books will direct a braveman
0x4fbeeb|Fa locked door
0x4fbf09|hen all six books 
0x4fbf1e|returned to
[...]

That's a good sign. It still flagged some nonsense as English, but not too much. For example:


[...]
0x6dec87|PXPzzzzzzzz
[...]

That might be the name of someone my kids follow on TikTok though.

Let's see what's in some randomly selected games. Beyond Shadowgate apparently has some unused items among the other item descriptions:


[...]
0x33bca61|THIS IS A GREEN SEALED SCROLL
0x33bca7f|THIS IS A BLUE SEALED SCROLL
0x33bca9c|THIS IS A RED SEALED SCROLL
0x33bcab8|THIS IS A YELLOW SEALED SCROLL
0x33bcad7|OBJECT UNUSED 5
[...]
0x33bcb67|OBJECT UNUSED 6
[...]

This appears in a few places, meaning developers must have packaged up some DOS DLLs or EXEs.


[...]
0xdfb326|Copyright Microsoft Corp 1981
[...]

It Came from the Desert might have included a copy of DeluxePaint:


[...]
0x14fe1a14|Are you sure you want to
0x14fe1a6e|To use DeluxePaint interactively
0x14fe1aa8|mouse and driver must be installed
0x14fe1ace|Save changes to picture 
0x14fe1afa|creating new picture
0x14fe1b10|loading another picture
0x14fe1b2a|changing format
0x14fe1b4b|qcreate a custom brush
[...]

There's still a lot of garbage like these being detected:


NOAAAAAAAAAAAAAAAA
bcdefghhhghhhghhhghhhghhhi
PPPZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZX

Three consecutive letters are never valid English but I get it. Lingua is simply saying that of all the languages it knows these are most likely English. If we want to slow things down a little more we can add a function to reject some strings.


//used to reject things that definitely aren't valid english
protected static boolean reject(String s){
 boolean hasV=false;
 int l=s.length();
 char[] chars=s.toCharArray();
 char current=chars[0];
 int count=0;
 for(int i=1;i<l;i++){
  if(!hasV){
   if((current=='a')||
    (current=='e')||
    (current=='i')||
    (current=='o')||
    (current=='u')||
    (current=='A')||
    (current=='E')||
    (current=='I')||
    (current=='O')||
    (current=='U')){
     hasV=true;
   }
  }
  if(current!=chars[i]){
   current=chars[i];
   count=0;
  }else{
   count++;
   if(count>2){
    return(true);
   }
  }
 }
 return(!hasV);
}

YAAAAAAAAAAAWN. This is kind of boring. I tried a couple games and found nothing interesting. Maybe I should try all of them and load it into a database.


sqlite3 pcetext.db
CREATE TABLE bins(id integer NOT NULL, name text NOT NULL);
CREATE TABLE bintext(id integer NOT NULL, binid integer NOT NULL, address TEXT, etext TEXT, FOREIGN KEY(binid) REFERENCES bins(id));

Now we'll write some code to scan .cue sheets for .bin files that are data tracks. I already had this sitting around to scan .bin files for audio tracks but needed to flip it around a little.


public class CueFileFilter implements FileFilter{
 @Override
 public boolean accept(File f){
  if(f.isDirectory()){
   return(true);
  } 
  return(f.getName().toLowerCase().endsWith(".cue"));
 }
}
[...]
//get a list of all .bin files that are data tracks
protected static ArrayList<String> getBinPaths(String rootDir){
 ArrayList<String> binPaths=new ArrayList<String>();
 BufferedReader cueReader=null;
 String basePath=null;
 String currentLine=null;
 String nextLine=null;
 try{
  ArrayList<File> cueFiles=FileUtils.getAllFilesRecursive(new File(rootDir),new CueFileFilter());
  for(File f:cueFiles){
   int lastIndex=f.getPath().lastIndexOf(File.separator);
   basePath=f.getPath().substring(0,lastIndex+1);
   cueReader=new BufferedReader(new InputStreamReader(new FileInputStream(f)));
   currentLine=null;
   while((currentLine=cueReader.readLine())!=null){
    if(currentLine.toUpperCase().startsWith("FILE")){
     nextLine=cueReader.readLine();
     if(nextLine.toUpperCase().contains("MODE1")){
      String fileName=currentLine.substring(currentLine.indexOf("FILE ")+5,currentLine.lastIndexOf(" BINARY")).replace("\"","");
      binPaths.add(basePath+fileName);
     }
    }
   }
   cueReader.close();
  }
 }catch(Exception x){
  x.printStackTrace();
  if(basePath!=null){System.err.println("basePath="+basePath);}
  if(currentLine!=null){System.err.println("currentLine="+currentLine);}
  if(nextLine!=null){System.err.println("nextLine="+nextLine);}
 }finally{
  try{if(cueReader!=null){cueReader.close();}}catch(Exception x){ }
 }
 return(binPaths);
}

The main code now looks more like:


public static void main(String[] args){
 final String binRoot="/";
 final String connectionString="jdbc:sqlite:pcetext.db";
 StringBuffer query=new StringBuffer();
 LanguageDetector ld=LanguageDetectorBuilder.fromAllLanguagesWithLatinScript().build();
 Connection connection=null;
 try{
  connection=DriverManager.getConnection(connectionString);
  Statement statement=connection.createStatement();
  //let's start fresh
  statement.execute("DELETE from bins");
  statement.execute("DELETE from bintext");
  ArrayList<String> binPaths=getBinPaths(binRoot);
  int binTableId=0;
  int binTextTableId=0;
  int totalCount=binPaths.size();
  for(String binPath:binPaths){
   query=new StringBuffer();
   query.append("INSERT INTO bins values(");
   query.append(binTableId);
   query.append(",'");
   String binName=binPath.substring(binPath.lastIndexOf(File.separator)+1);
   query.append(binName.replace("'",""));
   query.append("');");
   statement.execute(query.toString());
   Map<String,String> map=AsciiStringFinder.findInFile(binPath,10,true,false,false,null);
   int mapSize=map.size();
   int counter=0;
   System.out.println("scraping ["+binTableId+"/"+totalCount+"]: "+binPath+" [max entries="+mapSize+"]");
   for(Map.Entry<String,String> entry:map.entrySet()){
    if(counter%1000==0){//track progress
     System.out.println("["+counter+"/"+mapSize+"]");
    }
    String address=entry.getKey();
    String stringCandidate=entry.getValue();
    Map<Language,Double> confidence=ld.computeLanguageConfidenceValues(stringCandidate);
    Language l=ld.detectLanguageOf(stringCandidate);
    if(l.equals(Language.ENGLISH)){
     double score=confidence.getOrDefault(Language.ENGLISH,Double.MIN_VALUE);
     if((score>0d)&(!reject(stringCandidate))){
      query=new StringBuffer();
      query.append("INSERT INTO bintext values(");
      query.append(binTextTableId);
      query.append(",");
      query.append(binTableId);
      query.append(",'");
      query.append(address);
      query.append("','");
      query.append(stringCandidate);
      query.append("');");
      try{
       statement.execute(query.toString());
       binTextTableId++;
      }catch(Exception qx){
       System.err.println("failed insert: "+query.toString());
      }
     }
    }
    counter++;
   }
   binTableId++;
  }
 }catch(Exception x){
  x.printStackTrace();
  System.err.println(query.toString());
 }finally{
  try{if(connection!=null){connection.close();}}catch(Exception x){ }
 }
 System.out.println("done and whatever");
}

Alright let's run that in the background for a while and see what we get...


select count(*) from bintext;
237652

Alright, that's going to take a while to read through. How about starting with something simple, like games that might have a hidden sound test:


select b.name, bt.address, bt.etext from bintext bt inner join bins as b on b.id=bt.binid where bt.etext like '%sound test%';
Babel (Japan) (Track 02).bin|0x12d4185| SOUND TEST 
Cosmic Fantasy 3 - Bouken Shounen Rei (Japan) (Track 02).bin|0x1b2d545| SOUND TEST 
Death Bringer - The Knight of Darkness (Japan) (Track 02).bin|0x51082a| SOUND TEST 
Fiend Hunter (Japan) (Track 02).bin|0x937832|FIEND HUNTER SOUND TEST
Gate of Thunder + Bonks Adventure + Bonks Revenge (USA) (Track 22).bin|0x8efc1|SOUND TEST
PC Engine Hyper Catalog CD-ROM 4 - 1993 Winter (Japan) (Track 52).bin|0xba4c1|SOUND TEST
PC Engine Hyper Catalog CD-ROM 3 - 1993 Summer (Japan) (Track 21).bin|0xba4c1|SOUND TEST
PC Engine Hyper Catalog CD-ROM 5 - 1994 Spring (Japan) (Track 25).bin|0xba4c1|SOUND TEST
PC Engine Hyper Catalog CD-ROM 6 - 1994 Summer (Japan) (Disc A) (Track 06).bin|0x93a41|SOUND TEST
Inoue Mami - Kono Hoshi ni Tatta Hitori no Kimi (Japan) (Track 2).bin|0xc7ffa51|SOUND TEST
Lady Phantom (Japan) (FABT) (Track 02).bin|0x2c6ce9| SOUND TEST 
Monster Lair (USA) (Track 02).bin|0x1238d1| SOUND TEST 
Moonlight Lady (Japan) (Track 02).bin|0x1dbb597| SOUND TEST 
Neo Nectaris (Japan) (Track 02).bin|0xcef81|SOUND TEST
Quiz Avenue III (Japan) (Track 2).bin|0xf35ca0|SOUND TEST
Rayxanber III (Japan) (Track 02).bin|0x2dd090|   SOUND TEST   
Spriggan Mark 2 - Re Terraform Project (Japan) (Track 02).bin|0x95fea|SOUND TEST
Super Mahjong Taikai (Japan) (Track 02).bin|0x927a9|  SOUND TEST   
Wonder Boy III - Monster Lair (Japan) (Rev 4) (Track 02).bin|0x1238d1| SOUND TEST 

Of course this doesn't mean the sound test screen is hidden or exists at all. These are merely games that are likely to have a sound test screen. If one wanted to enter the lucrative world of finding hidden game content this is a good start.

It's also now trivial to dump all the text to a csv and view in any spreadsheet program:



Dumping all the text to csv



I'm really debating whether to post this spreadsheet with all the text ripped. If you ask nicely I would probably send it to you.

I'm saving the best part for last. It's a massive plot twist for this article and I'm kind of giddy about it...

A few months ago I found that for unknown reasons there was text from the English version Ys II in Dragon Slayer: The Legend of Heroes II. I assumed that maybe was caused by some cross-contamination by the Hudson team working on Falcom ports. Oh, no, the full situation is much weirder.



More Ys text where it doesn't belong



WHAT.. IS.. GOING.. ON.. HERE?

That is a full five things that are not Ys II that include a bunch of text from Ys II. I verified this against multiple rips including ones I made myself. Also none of these five games have US localizations.

Now one of these is the RPG sampler that includes a demo of one of the other games so we can ignore that. We're still left with:

This makes me suspect that some portion of Ys I&II was used as a project template. Maybe some of it was bundled-up into a library that was linked by these other projects. It's still puzzling why it's the English version of the game. There must be something interesting in that version that is missing from the Japanese edition.

It's also possible this was due to AlfaSystem rather than Hudson. It looks like they worked on all these games in some capacity.

Alright, this article is already way past its initial goal. I wanted to figure out how to search for English text in arbitrary binary files. That's done. Along the way I found an exciting new mystery. The only debate is what do I tackle next?



Related