Skip to content

thehaiwave/some-code-scripting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

some-code-scripting

I'll be putting my scripts here, since I always seem to lose them.

Scripts

  • separate_indicators.sh

This thing is used to break down a GeoJSON file into smaller files related to each other. In my case, smaller files with GeoJSON objects whose codigo_act property was the same (as per my data's hierarchy), because 70M files are not fun. The script relies heavily (mostly) on jq, which allows you to reshape and move JSONs around very easily.

I don't know if the script is efficient, to be honest. Probably not, but it does go through a 187,281 line file in about 30 seconds. Good enough for me.

Original file:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          -100.37162845,
          25.69593791
        ]
      },
      "properties": {
        "codigo_act": "236211",
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          -100.33016604,
          25.67235888
        ]
      },
      "properties": {
        "codigo_act": "118210",
      }
    }
  ]
}

Separated files:

// File 1
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          -100.37162845,
          25.69593791
        ]
      },
      "properties": {
        "codigo_act": "236211",
      }
    }
  ]
}
    
// File 2
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          -100.33016604,
          25.67235888
        ]
      },
      "properties": {
        "codigo_act": "118210",
      }
    }
  ]
}
  • create_ind_objects.py

So here's the thing, the data that I'm working with is way too granular. I'm not going to manually type the name of each of the fiels, no way. Also the chosen fields are arbitrary so I might add or delete a few, and I need it in a certain format.

Anyway, this script is pretty simple. I give it a list of ID's corresponding to the data fields whose name I want, and out comes an array of JSON objects with both the name and the ID of the field. It's pretty disgusting how I did it, so I'm going to explain it. So basically, the data is divided like this:

      ┌─111
      │     ┌─1121
 ┌─11─┼─112─┴─1122
 │    └─113
 │          ┌─2311
─┼─23───231─┼─2312
 │          └─232
 ├─31───NULL
 │          ┌─4611
 └─46───461─┴─4612
 
  [0]   [1]    [2]

You get the idea. When the website that has the data begins to load, it makes an AJAX call to some external API. At the beginning, it only fetches the level 0 IDs. It is only when you click on one of them that the site calls that ID's children (level 1), and when you click on a level 1 ID, it fetches that ID's children (a level 2 ID), and so on. The tricky part, then, is how to get, say, a level 3 ID. You could try to use something like curl or some PHP script to get the site's HTML, but that will only get you the markup, so what can we do?

The answer lies in the URL endpoints the site uses. The initial endpoint looks like this:

https://www.inegi.org.mx/app/api/clasificadores/interna_v1/frontArbol/listaRamas/?proy=75&indica=0&_=1611671251717

We need to take note of a couple of things. First, it has an .mx TLD, so we should already be questioning the validity of the data. Secondly, it has a parameter called indica. That sure does sound a lot like the indicators (IDs) that we are looking for. Let's see what the URL looks like when we click on a level 0 ID, let's say '11'.

https://www.inegi.org.mx/app/api/clasificadores/interna_v1/frontArbol/listaRamas/?proy=75&indica=11&_=1611671251718

Ok, so it would seem like that parameter tells the endpoint which ID's children to load. However, also notice that the very last parameter (_) also changed. In fact, it has increased by exactly one. At this point it becomes obvious how that endpoint works. We need to provide both the ID whose children we want to see, and the depth, so to speak, at which those children are.

It's interesting to note that the IDs are organized in such a way that each sub-level adds exactly one more number to the ID, and so it's easy to figure out which level we should look up, given that our base level (in the URL) is 1611671251717. It is only when you have the correct URL that you can finally invoke curl to fetch the IDs. You then go through the response and move indexes around to see if the ID you want is in there, or if you should go deeper.

Come to think of it, this script has an issue. I used a zip() function because I knew that the IDs I was looking for existed. If one (or many) of the IDs you provide initially do NOT exist, the length of the arrays fed to the zip() function won't match. Will fix later.

  • parse_and_conquer.rb

Doesn't do much. It pull data from some files and creates JSON objects with that data. The exact JSON structure can be changed.

Now, if you look at the code you'll notice that I have a variable that stores the JSON structure. I then iterate through the data I want, change that original JSON structure and fill it with my data, and finally convert it to a real JSON object and push that newly created object to an array. Originally, I didn't cast the structure variable as a JSON, I just pushed it. This resulted in an array of elements that all referenced the same object; remember, I'm not copying the original JSON structure variable, I'm just chaning it. The problem with this array, then, was that ALL of the elements would change when the reference was updated, so instead of having:

#FIRST ITERATION
some_array = [ obj1 ]


#SECOND ITERATION
some_array = [ obj1, obj2 ]

#THIRD ITERATION
some_array = [ obj1, obj2, obj3 ]

You end up with:

#FIRST ITERATION
some_array = [ obj1 ]


#SECOND ITERATION
some_array = [ obj2, obj2 ]

#THIRD ITERATION
some_array = [ obj3, obj3, obj3 ]

The first of my problems came when I took more than 10 minutes to realize that this was the issue. Once I knew what it was (thanks to some very nice S.O. people), the solution was straightforward: just duplicate the original JSON structure variable each iteration. Ruby provides both #dup and #clone to copy objects. There are some differences between them, but the point is that I ended up duplicating the object like I said earlier. Now, Ruby also lets you see the ObjectID of an object. So, naturally, you'd want to see that as the script is running so you can track the object references. Color me surprised when the ObjectIDs of the elements that I pushed to the array were different, but the reference issue didn't go away. I honestly, truthfully, and frankly have no idea why that is. There has to be something about the way Ruby handles binding that I'm missing, but casting the JSON structure as a JSON object creates a new object each time I push to the array, so that solves the issue, but it is still not really ideal because I have to then run some RegEx functions on the resulting object because Ruby does some weird formatting things to it.

About

An album by Perl Sweatshirt

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published