Some More Vim Magic

Lets delay the next post about web scraping a bit longer to talk about vim again.

Imagine you have a file filled with thousands of sets of data, all of them looking like this one:

Some text here
more text
text text
ST 12345
more text
more text

The first line is the start of the set, and the last one is the end of the set. And then it is followed by another set. You will notice that one line in the set is a State, zip code combo. Imagine that you want to put the state and the zip code in their own line each. If you only had a few sets you may just do it manually, or create a macro. But if you have over 100K records, none of those options sounds good anymore. What you need is a global action (:g).

The idea is simple, find all the lines that follow the same pattern. The pattern being: Starts with 2 capital letters followed by a space, which is followed by 5 digits. The search pattern is easy:


/^[A-Z]\{2}\s\d\{5}

And performing a global action is also easy, for example, if you wanted to delete those lines you would do this:


:g/^[A-Z]\{2}\s\d\{5}/d

But we want to do something a bit more complex. We want to find the space in that match, and replace it with a <CR> (carriage return) so that we end up with State and Zip in their own line. My first attempt was this:


:g/^[A-Z]\{2}\s\d\{5}/normal ^f r<CR>

But that doesn’t work. You end up with lines looking like this:

CAR>

because we are searching for the space, replacing it with a < and then deleting from the space all the way to the end of the line, and entering insert mode. Then we insert “R>”.

What about this:


:g/^[A-Z]\{2}\s\d\{5}/normal ^f r\<CR>

It doesn’t work either.

I decided to take a look at the :normal help in vim. There I found out that if you want to use printable characters to represent non-printable one, you need to use :exec. So I came up with this one:


:g/^[A-Z]\{2}\s\d\{5}/exec "normal ^f r\<CR>"

That one worked.

Once more I got to experience the power of vim, and I have one more reason to not go back to my old editor. Since I went vim a couple of years ago, I’ve never looked back, and so far, I think I never will.

Who Needs Awk? I Got Vim

I wanted to post a new entry following on the topic of scraping, but I am just too eager to share the power of vim with you. Besides, what you learn on this post may be helpful to you on your own scraping adventures.

Scraping the web is fun, but sometimes we end up with data that is not really formatted in any way that we can use. If you don’t plan ahead, you may end up with files containing data that needs further processing before you can do anything useful with them. I was in such position after my very first scraping adventure. I ended up with files containing data, but there was no easy way to use that data without processing it a bit first. Think of a file that has data like this:

Triple T Autobody & Paint 74-H Hamilton Drive Novato,
CA 94949 (415) 883-2041 Directions (~9.55 miles)
Bay Area Frame 2218 Market Street San Pablo, CA 94806 (510)
233-1448 www.bafautobody.com Directions (~13.00 miles)
Stewart's Body Shop 12540 San Pablo Avenue Richmond,
CA 94805 (510) 235-3515 Directions (~14.31 miles)
Bavarian Professionals 1218 7th Street Berkeley, CA
94710 (510) 524-6000 www.bavpros.com Directions (~16.58 miles)

As you can see, we have information about body shops, but that information is broken into two lines for each body shop, and not in the same place. I noticed that the lines all broke at around the same characters count, although I’m not sure why. What I wanted to do was to arrange the data like this:

Shop Name
Address
State
Zip
Phone [website]

But there is no easy way to do it since the lines are broken and not on the same place. I knew the first step would be to put each set of information into its own line.

It is worth noting here that my first two files had a different format. For some reason they were formatted something like

Shop Name
Address State Zip
Phone
Directions (xx.xx miles)

Again, I’m not sure why that was the case. The fact that the method I used to convert the first format into single-line sets of information also fit this second format was a happy accident.

So we want to put every set of information into its own line, but how? Well, the first thing you need to do is to search for a patter that you can use. In this case I noticed that all the information sets ended with (~xx.xx miles). That was all I knew.

I was dealing with 5 files, each of which contained about 15,000 sets of information. That makes two things clear:
1) Manual work will not be an option here.
2) I should assume that somewhere in those files there is a set that is broken into more than two lines.

Things are not as simple as they seemed at first, but at least I know that there is a pattern. I took advantage of that pattern, and created a simple macro in vim.

The macro instruction, saved in registry a, where simple. I started at the first line. Place a mark at the beginning of the line, search for the next line that ends in “miles)”, enter visual mode with “V”, go to the mark previously set, hit “J” to join the lines. Move on to the next line.

That was all I had to do, then run the macro for as many times as needed, usually starting with 10,000 runs (10000 @a) and then fewer each time. Repeat for every file.

Now that I had all my files with every set of information in its own line, it was time to create one big file. How? Well, lets use cat.


cat *.txt > composed.txt

By running that command in the command line, I successfully create a single file containing the information of all the other files. You should know that I had all my files in a directory where there was nothing else but those files.

Now that I have a big file containing sets of data each on its own line, it is easy to go from one line format to the format that I wanted. It is a matter of creating a macro that converts one line into the format I wanted, and then running the macro as many times as there are lines in the file. At this point the file was close to 100K lines long.

I hope this shows you the power of macros and vim. But, what does this have to do with awk? well, awk was an option I looked at when I first started dealing with the problem. In fact, awk processes files line by line, which is the reason I wanted to put each set of information in its own line in the first place.

I am not an awk expert, but I think there is not much that can be done with awk that cannot be done with vim. For example, I had another set of files. Each file started with “Array” and then it had thousands of lines, and ended with “)”. It was something like this:

Array
(
   [0] => Array
      (
      [name] => name
*
      [address] => address
city
state, zip

      [business_text] => may or may not have text here
      [phone] => (XXX) XXX-XXXX
   )

   [1] => Array
      (
      [name] => name
*
      [address] => address
city
state, zip

      [business_text] => may or may not have text here
      [phone] => (XXX) XXX-XXXX
   )

   [2] => Array
      (
      [name] => name
      [address] => address
city
state, zip

      [business_text] => may or may not have text here
      [phone] => (XXX) XXX-XXXX
   )
   ... Add a lot more sets like the previous ones
)

Again, if you have many files you can concatenate them with cat, and then work on a single file. I wanted to get rid of the Array( lines, empty lines, and the lines that contained only an *. You can do that in awk, but I decided to do it in vim.

First I noticed a pattern: all lines that have Array in them are followed by a line that only has (. Based on that, you know that if you find a line that contains Array, you can delete it and the following line too.

My first attempt was with a macro. First do a search for Array$, this would find all the lines that ended with Array. Then start a macro, delete 2 lines (2dd), press n to go to the next pattern match, and end the macro. Repeat the macro as many times as needed.

At some point, while the macro was on a 100,000 time run, it occurred to me that vim had to have a better way to do what I wanted. A quick search revealed the command that I was looking for. Now, it was time to delete all lines that ended with a ). And I was armed with new knowledge, so a simple

:g/)$/normal 2dd

did the trick. And finally, to delete all the empty lines and the lines that had only an asterisk, I did these two commands:

:g/*$/d
:g/^$/d

And that was all. I was done with my file editing. How would I have done it with awk? Who cares, this time I got vim!

Some links for reference:
http://stackoverflow.com/questions/16223054/how-to-delete-all-lines-matching-a-pattern-and-a-line-after-in-vim
https://duckduckgo.com/?q=remove+all+lines+matching+patter+vim

Not used this time, but good to know:
https://duckduckgo.com/?q=delete+all+pattert+matches+vim
http://stackoverflow.com/questions/7842333/delete-matching-search-pattern-in-vim

Starting with Git

I had postpone using some kind of VCS for a long time. In the past (way back) I looked at SVN, but did not quite like the fact that it uses a server, maybe because back then I did not understand much how exactly it worked. For some reason, I thought I wold have to move all my projects somewhere else. It is a stupid way of thinking about it, but I was young and a bit more naive than now. Not long ago I looked at git. I installed it, and even created a repository for my core project, but then I felt uneasy using it without really understanding what I was doing, so I deleted the .git folder from core, and forgot about the whole thing.

A couple of weeks ago, I received a quote request, and one of the project requirements was that it be source controlled using git. I honestly informed the client that I had no experience with git, and that I may need help getting started. The client agreed, and the project was on. This week we finally got the good to go for the project, and I immediately had problems.

First, it seems their server was not configured properly, but then when it all worked, I did not know what to do. I was supposed to clone the project that they had initialized, which I did, but then got nothing but an empty directory. Then, after doing git pull all I got was an index.php file that said replace me. I was lost.

Fortunately, one of the guys in the team was very helpful explaining what was going on, but more importantly, he provided me with some good resources (http://sixrevisions.com/resources/git-tutorials-beginners/) that I could read. That helped me a lot, and I’ve been reading a few articles and resources about git. If you are new to git, I recommend the next few reads:

http://net.tutsplus.com/tutorials/other/easy-version-control-with-git/
http://www.ralfebert.de/blog/tools/git_screencast/
http://tom.preston-werner.com/2009/05/19/the-git-parable.html
http://www.webdesignerdepot.com/2009/03/intro-to-git-for-web-designers/

And here is a cheat sheet.

Once you’ve understood the basics of git, you can continue with this reads about git submodules, and why you might not want to use them, as well as some options.
http://git-scm.com/book/ch6-6.html
http://stackoverflow.com/questions/3456888/git-repository-inside-another-git-repository
http://web.archive.org/web/20090302072040/http://www.kernel.org/pub/software/scm/git/docs/git-submodule.html
http://codingkilledthecat.wordpress.com/2012/04/28/why-your-company-shouldnt-use-git-submodules/
http://ayende.com/blog/4746/the-problem-with-git-submodules
http://blog.appfog.com/getting-lean-and-solving-problems-farewell-to-git-submodules/
http://stackoverflow.com/questions/2669477/sharing-code-between-two-or-more-rails-apps-alternatives-to-git-submodules
http://www.rubyinside.com/giternal-easy-git-external-dependency-management-1322.html
http://psionides.eu/2010/02/04/sharing-code-between-projects-with-git-subtree/

Finally, this explains how to get rid of a submodule, in case you want to try them and later decide that you don’t really like them.
http://stackoverflow.com/questions/1260748/how-do-i-remove-a-git-submodule

I asked a question on stackoverflow, where someone said I should not start with git as my first VCS. I think this answer is much like when people tell you “don’t use VIM. It is very complicated”. The truth is that, while it is true that VIM is more complicated that a simple text editor, and Git is more confusing than most VCS, there are reason why you should try them. I won’t get into the case of VIM, because that is out of the scope of this post, but in the case of Git, I’ve found a few things I like:

* Everything is local. There is no need to be online, or connected to the server in order to work. You could work offline, and then just push your changes once you can get an internet connection.
* You don’t even need a server. If you are developing on your own, you don’t need a server of any kind, since everything is local.
* Most of everything is kept in .git. When you initialize a repository, most of the changes and settings, and what not, that is related to git, is kept in a directory called .git in the top directory of the repository.
* Once you clone a project, you get the full history, and again, it is all local.
* It is easy to use. Once you understand how it works, it is actually easy to use. I recommend you read the git parable that is linked above. It will really get you to understand git very quickly.

There are of course things that get complicated, like submodules, but just because something is complicated does not mean you should stay away from it.

Yes, I am a beginner. Yes, there are things I don’t know. Yes, I will make mistakes that will make me look stupid. Yes, the person who told me not to use git as my first VCS knows more than me about it an probably had a legitimate reason why he said that. However, there were people saying you should not use VIM, but I did anyway, and to this day I haven’t looked back.

There are times when someone, maybe even me, will try to discourage you from using something or other, but you should always give it a try. Just because I find something hard, does not mean you will too. And even if you do, what is the fun on doing just the easy stuff?

Links of the day 12/11/2012

I have been just sharing links these last few posts, and sadly, today is not the exception. I am aware there are a few posts that need to be written, and I need to continue on that series that I started about building AIR Apps, but time has just not been too kind with me lately. Anyway, I hope this links can keep you busy.

Javascript:
http://davidwalsh.name/documentfragment – Just a quick intro to document fragment.
http://davidwalsh.name/deferred – Javascript defer for a cleaner code.
http://www.sitepoint.com/get-started-with-three-js/ – Getting started with three.js
http://blog.millermedeiros.com/stop-writing-plugins-start-writing-components/ – Plugins VS Components. Interesting.
http://blog.millermedeiros.com/namespaces-are-old-school/ – Namespaces are old school, use modules.
http://blog.millermedeiros.com/amd-is-better-for-the-web-than-commonjs-modules/ – A look into why AMD is better than commonjs modules. Is it?

VIM:
blog.millermedeiros.com/tag/vim/ – Improved VIM status bar. Nice!

HTML 5
http://davidwalsh.name/phone-link-protocol – The phone link protocol.
http://davidwalsh.name/vibration-api – The vibrating API. Lets hope it does not get abused.

WordPress:
http://davidwalsh.name/ssl-wordpress – Quick and easy way to force SSL on wordpress sites.
http://dzineblog.com/2012/12/best-practices-for-keeping-wordpress-clean-secure.html – WordPress security best practices.

Interior Design:
http://www.home-designing.com/2012/12/toblerone-house-brazil
http://www.home-designing.com/2012/12/super-small-space-living-inspiration-ikea

Ubuntu
http://www.atareao.es/ubuntu/conociendo-ubuntu/quieres-aprender-a-crear-paquetes-para-ubuntu/ – How to create Ubuntu packages. (Spanish)

Design:
http://dzineblog.com/2012/11/33-new-freebie-buttons-and-icon-sets-released-in-autumn-2012.html – Free icon sets.

Ruby:
http://ruby-python.com.ar/ruby/ – A ruby tutorial that I have not yet followed, but I share in case you are interested. (Spanish)

Other:
http://davidwalsh.name/twitter-cards – Nice explanation on how to create twitter cards.
http://creativefan.com/black-and-white-backgrounds/ – Black and white backgrounds. Mostly pictures, and some of them are not really B&W, but still interesting.
http://alt1040.com/2012/12/twitter-rastrea-webs – Twitter keeps track of the sites you visit. And how to stop it. (Spanish)
http://build-podcast.com/ – A postcast about development tools.
http://www.mightydeals.com/ – A website that seems to offer good deals on resources for web professionals.

Enjoy the readings, and don’t hate me for posting nothing but links these last few rounds.

Do it Faster With VIM!

This is just a quick-n-short entry.

Every time I do something fast, I get a bit of an adrenaline rush. This time I was able to save a ton of time by using vim. I am currently working on a little project for a student back in Mexico. She wanted to build a didactic game so learning would be more fun, so she sent me a few specifications for the game, and a list of questions and answers. Every question is formulated like this:

N.-) Some Question?

a) Ans. 1

b) Ans. 2

c) Ans. 3

There are three types of questions for 3 different types of subjects. The game is a simple racing game where a car encounters little markers along the road and for every marker a question pops up. Previously the player has selected which of the three types of questions they want. Based on that I decided to create an array to manage the questions and answers. The array would be like this:


array(
   type_1 =array(
      ...
   ),
   type_2 = array(
      ...
   ),
   type_3 = array(
      ...
   )
)

As you can see I created an array for each type of questions. This array contains another two arrays. One for the answers, and one for the questions, so the type arrays look like this:


type_N = array(
    questions = array(
      ...
   ),
   answers = array(
      ...
   )
)

This way it is easy to relate the questions to their answers, since the question in index 0, has answers in index 0 as well.

Since there are over 60 questions, creating the arrays would take too much time, and in this project time is really, really scarce. I decided to copy and paste all the question over to VIM. Luckily, the student is very organized and follow the same pattern for all the questions. So it was a matter of recording one macro that converts this:

N.-) Some Question?

a) Ans. 1

b) Ans. 2

c) Ans. 3

Into this:

questions[‘type_1’][‘questions’].push(‘Some Question?’);
questions[‘tyoe_1’][‘answers’].push(‘Ans 1’);
questions[‘type_1’][‘answers’].push(‘Ans 2’);
questions[‘type_1’][‘answers’].push(‘Ans 3’);

Then run this with a simple NN@a (NN was the number of times I had to run the macro). For questions type_2, and type_3, I just needed to change the index value from type_1 to type_2, or type_3.

I was done in less than 5 minutes. And some people thing VIM is stupid.

Links of the Day (Vim Edition) – 11/27/2012

After being obsessed about trying to find a vim talk, which I have not yet found, I ended up with a bunch of tabs open on my already cluttered firefox windows (that’s right, with an s). I decided to make the links of the day post for today exclusively about vim. Most of the links here were found via Hacker News, duckduckgo, or by following links on the articles found via the previous two methods.

Since all of this links point to articles about vim, I will not explain them, just leave each link on each line. Some articles are merely interesting, while others are really useful. Another thing I recommend, and which I’ve been doing, is reading the help pages in vim. Next time you open vim, type :help, and take a moment to read one of the pages there. Read one page every time you open vim, and you will progressively increase your vim knowledge.

Here are the links:

http://yannesposito.com/Scratch/en/blog/Learn-Vim-Progressively/
http://yehudakatz.com/2010/07/29/everyone-who-tried-to-convince-me-to-use-vim-was-wrong/
http://www.reddit.com/r/vim/
http://vimcasts.org/episodes/tabs-and-spaces/
http://vimcasts.org/episodes/soft-wrapping-text/
http://stevelosh.com/blog/2010/09/coming-home-to-vim/
http://www.catonmat.net/blog/why-vim-uses-hjkl-as-arrow-keys/
http://blog.sanctum.geek.nz/advanced-vim-registers/
http://robots.thoughtbot.com/post/13164810557/the-vim-learning-curve-is-a-myth
http://vimdoc.sourceforge.net/htmldoc/recover.html
http://winterdom.com/2009/02/vimswapfiles
http://my.opera.com/peterchenadded/blog/2008/12/27/gvim-7-1-swap-and-backup-files
http://www.sanjib.org/post/vim-backups-can-be-a-security-risk-for-php-configuration-files
http://vim.wikia.com/wiki/Remove_swap_and_backup_files_from_your_working_directory
http://superuser.com/questions/188815/how-to-hide-gvims-backup-files-under-windows-xp
https://addons.mozilla.org/en-US/firefox/addon/vimfx/
http://blog.carbonfive.com/2011/10/17/vim-text-objects-the-definitive-guide/?
https://github.com/bpowell/vim-android
http://blog.sanctum.geek.nz/vim-misconceptions/
http://www.derekwyatt.org/vim/vim-tutorial-videos/vim-novice-tutorial-videos/#
http://robots.thoughtbot.com/post/27041742805/vim-you-complete-me
http://amix.dk/blog/post/19083#10-kick-ass-Vim-tips
http://vimcheatsheet.com/
http://www.productionhacks.com/2012/05/06/my-third-attempt-at-vim/

These are quite a few links, so why don’t you grab a cup of tea and have a long read?