I wanted to post a new entry following on the topic of scraping, but I am just too eager to share the power of vim with you. Besides, what you learn on this post may be helpful to you on your own scraping adventures.
Scraping the web is fun, but sometimes we end up with data that is not really formatted in any way that we can use. If you don’t plan ahead, you may end up with files containing data that needs further processing before you can do anything useful with them. I was in such position after my very first scraping adventure. I ended up with files containing data, but there was no easy way to use that data without processing it a bit first. Think of a file that has data like this:
Triple T Autobody & Paint 74-H Hamilton Drive Novato, CA 94949 (415) 883-2041 Directions (~9.55 miles) Bay Area Frame 2218 Market Street San Pablo, CA 94806 (510) 233-1448 www.bafautobody.com Directions (~13.00 miles) Stewart's Body Shop 12540 San Pablo Avenue Richmond, CA 94805 (510) 235-3515 Directions (~14.31 miles) Bavarian Professionals 1218 7th Street Berkeley, CA 94710 (510) 524-6000 www.bavpros.com Directions (~16.58 miles)
As you can see, we have information about body shops, but that information is broken into two lines for each body shop, and not in the same place. I noticed that the lines all broke at around the same characters count, although I’m not sure why. What I wanted to do was to arrange the data like this:
But there is no easy way to do it since the lines are broken and not on the same place. I knew the first step would be to put each set of information into its own line.
It is worth noting here that my first two files had a different format. For some reason they were formatted something like
Address State Zip
Directions (xx.xx miles)
Again, I’m not sure why that was the case. The fact that the method I used to convert the first format into single-line sets of information also fit this second format was a happy accident.
So we want to put every set of information into its own line, but how? Well, the first thing you need to do is to search for a patter that you can use. In this case I noticed that all the information sets ended with (~xx.xx miles). That was all I knew.
I was dealing with 5 files, each of which contained about 15,000 sets of information. That makes two things clear:
1) Manual work will not be an option here.
2) I should assume that somewhere in those files there is a set that is broken into more than two lines.
Things are not as simple as they seemed at first, but at least I know that there is a pattern. I took advantage of that pattern, and created a simple macro in vim.
The macro instruction, saved in registry a, where simple. I started at the first line. Place a mark at the beginning of the line, search for the next line that ends in “miles)”, enter visual mode with “V”, go to the mark previously set, hit “J” to join the lines. Move on to the next line.
That was all I had to do, then run the macro for as many times as needed, usually starting with 10,000 runs (10000 @a) and then fewer each time. Repeat for every file.
Now that I had all my files with every set of information in its own line, it was time to create one big file. How? Well, lets use cat.
cat *.txt > composed.txt
By running that command in the command line, I successfully create a single file containing the information of all the other files. You should know that I had all my files in a directory where there was nothing else but those files.
Now that I have a big file containing sets of data each on its own line, it is easy to go from one line format to the format that I wanted. It is a matter of creating a macro that converts one line into the format I wanted, and then running the macro as many times as there are lines in the file. At this point the file was close to 100K lines long.
I hope this shows you the power of macros and vim. But, what does this have to do with awk? well, awk was an option I looked at when I first started dealing with the problem. In fact, awk processes files line by line, which is the reason I wanted to put each set of information in its own line in the first place.
I am not an awk expert, but I think there is not much that can be done with awk that cannot be done with vim. For example, I had another set of files. Each file started with “Array” and then it had thousands of lines, and ended with “)”. It was something like this:
Array (  => Array ( [name] => name * [address] => address city state, zip [business_text] => may or may not have text here [phone] => (XXX) XXX-XXXX )  => Array ( [name] => name * [address] => address city state, zip [business_text] => may or may not have text here [phone] => (XXX) XXX-XXXX )  => Array ( [name] => name [address] => address city state, zip [business_text] => may or may not have text here [phone] => (XXX) XXX-XXXX ) ... Add a lot more sets like the previous ones )
Again, if you have many files you can concatenate them with cat, and then work on a single file. I wanted to get rid of the Array( lines, empty lines, and the lines that contained only an *. You can do that in awk, but I decided to do it in vim.
First I noticed a pattern: all lines that have Array in them are followed by a line that only has (. Based on that, you know that if you find a line that contains Array, you can delete it and the following line too.
My first attempt was with a macro. First do a search for Array$, this would find all the lines that ended with Array. Then start a macro, delete 2 lines (2dd), press n to go to the next pattern match, and end the macro. Repeat the macro as many times as needed.
At some point, while the macro was on a 100,000 time run, it occurred to me that vim had to have a better way to do what I wanted. A quick search revealed the command that I was looking for. Now, it was time to delete all lines that ended with a ). And I was armed with new knowledge, so a simple
did the trick. And finally, to delete all the empty lines and the lines that had only an asterisk, I did these two commands:
And that was all. I was done with my file editing. How would I have done it with awk? Who cares, this time I got vim!
Not used this time, but good to know: