Useful script i made some months ago

I am from Venezuela and some months ago we had presidential elections.

The opposition, which is in someway my tendency (I am tired of this goverment), called a fraud. So in order to check that my self i made a crawler for the web page (www.cne.gob.ve/resultado_presidencial_2013/r/1/reg_000000.html) with the results in order to gather and analyze them.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import re
import threading

paths_re = re.compile('<s*li\s+class="region-nav-item"><a\s+id="region_ref"\s+href="(.*)"\s*>(.*)<\s*/a\s*>\s*<\s*/li\s*>', re.IGNORECASE)
values_re = re.compile('<tr\s+class="tbsubtotalrow"\s*>\s*<td\s+class="lightRowContent"\s+align="center"\s*>\s*<img\s+src=".+"\s+alt=""\s+width="50"\s+/>\s*
\s*<td\s+class="lightRowContent">\s*<table\s+width="100%">\s*<\s*tr\s*>\s*<\s*td\s+align="left"\s*>\s*<\s*a\s+href="javascript:showCandidateInfo\(\d\);"\s*>(.*)<\s*/a\s*>\s*<\s*/td\s*>\s*<\s*/tr\s*>(\s*<\s*tr\s*>\s*<\s*td\s*>\s*<\s*font\s+color="\#990000"\s*>\s*Adjudicado\s*<\s*/font\s*>\s*<\s*/td\s*>\s*<\s*/tr\s*>\s*|\s*)<\s*/table\s*>\s*<\s*/td\s*>\s*<\s*td\s+class="lightRowContent"\s+align="right"\s*>\s*<\s*span\s*>(.*)<\s*/span\s*>\s*<\s*/td>\s*<\s*td\s+class="lightRowContent"\s+align="right"\s*>\s*<\s*span\s*>(.*)<\s*/span\s*>\s*<\s*/td>\s*', re.IGNORECASE)
filter_re = re.compile('%|\.|\s')
main_url = 'http://www.cne.gob.ve/resultado_presidencial_2013/r/1/'
depth_map = ['All', 'State', 'Municipality', 'County', 'Center', 'Table']
separator = '\t'
outfiles = []
outfilename = 'electionsdata'
qttythread = 6
rthreads = []
generalfile = open(outfilename, 'w')
firstorder = dict()

def crawl_page(url, region, depth, file_to_write):
    page = requests.get(url)
    paths = paths_re.findall(page.text)
    values = values_re.findall(page.text)

    if region == separator:
        names = map(lambda v: v[0], values)
        counter = 0
        for name in names:
	        firstorder[name] = counter
	        counter += 1
		firstorder = map(lambda v: v[0], values)
        print_headers(file_to_write, names)

    print_info(file_to_write, region, values, depth)

    if region == separator:
        chunked_lists = chunk_list(paths, (len(paths) + (qttythread - len(paths) % qttythread)) / qttythread)
        threadn = 0
        for chunked_list in chunked_lists:
            t = threading.Thread(target=recursive_calls, args=(chunked_list, region, depth, outfiles[threadn], ))
            t.start()
            rthreads.append(t)
            threadn += 1
    else:
        recursive_calls(paths, region, depth, file_to_write)

def recursive_calls(paths, region, depth, file_to_write):
    savedregion = region
    for path in paths:
        pathsurl = path[0]
        pathsfor = path[1]
        region =  savedregion + separator + pathsfor
        print region
        crawl_page(main_url + pathsurl, region, depth + 1, file_to_write)

def chunk_list(l, chunksize):
    result = []
    for i in xrange(0, len(l), chunksize):
        result.append(l[i:i+chunksize])
    return result

def print_headers(file_to_write, names):
    result = 'Type'
    for region_type in depth_map:
        result += separator + region_type
    for name in names:
        result += separator + name + ' value\t' + name + ' %'
    file_to_write.write(result + '\n')

def print_info(file_to_write, region, values, depth):
	order_values(values)
    result = depth_map[depth] + region
    for i in xrange(len(depth_map) - depth - 1):
        result += separator
    for value in values:
        result += separator + clean_number_srt(value[2]) + separator + clean_number_srt(value[3])
    file_to_write.write(result + '\n')

def order_values(values):
    position = 0
    while position < len(values):
        needed_position = firstorder[values[position][0]]
	    if position == needed_position:
            position += 1
	    else:
            swap(values, position, needed_position)

def swap(l, i, j):
	aux = l[i]
	l[i] = l[j]
	l[j] = aux

def clean_number_srt(number_str):
    return filter_re.sub('', number_str).replace(',', '.')

if __name__ == "__main__":
    for i in xrange(qttythread):
        outfile = open(outfilename + str(i), 'w')
        outfiles.append(outfile)

    crawl_page(main_url + 'reg_000000.html', separator, 0, generalfile)

    for rthread in rthreads:
        rthread.join()

    for outfile in outfiles:
        outfile.close()
    generalfile.close()

In the end, i didn’t run it. I think this final version is pretty complete and can be used for another variety of things, thus I am sharing the code :). It use a number of threads and check the pages in a recursive way writing to different files. The regular expressions are pretty ugly, but i didn’t want to use a HTML parser or something like that.

Advertisements

Setting up Tomcat 7 with IntelliJ IDEA

I have been doing a java project with IntelliJ IDEA. It is a quite normal IDE, with some supposed advantages over Eclipse. I got used to it in two weeks. 

So i’ll to explain how to run a project with Tomcat 7.

  1. Download the core version of Tomcat 7 here. DO NOT INSTALL IT WITH APT-GET OR ANY PACKAGE MANAGER
  2. Uncompress the folder anywhere you want
  3. Click on Run -> Edit Configurations…
    Search on the left for Tomcat Server and select Local

  4. Click Configure

  5. In the name put Tomcat 7
    Tomcat Home and Tomcat base directory are the folder you created in step 2

  6. Click Ok
    Set Startup page: http://localhost:8081/, HTTP port: 8081
    You can change this values to any you want

  7. Click on Run -> Edit Configurations…
    Click on the + sign 

  8. Select Tomcat Server Local
    Go to Deployment tab

  9. Click on the + sign, select artifact you want to add to the server (in the picture i added one called dsp-api.war)
    Click Ok

  10. That’s all run and test your application 🙂

Always learning !

As i wrote before, i did participate on Google Code Jam Qualification Round. I just did 3 problems and got 50 points.I’ll talk about one of the problems: Lawnmower.

#!/usr/bin/python
# -*- coding: utf-8 -*-

from sys import stdin

if __name__ == "__main__":
 ncases = int(stdin.readline())

 for ncase in xrange(ncases):
     n, m = map(int, stdin.readline().split(' '))

     lawn = [[] for j in xrange(n)]

     for row in xrange(n):
         lawn[row] = map(int, stdin.readline().split(' '))

         allle = True

         for row in xrange(n):
             for column in xrange(m):

                 hle = True
                 vle = True
             for frow in xrange(n):
                 if lawn[frow][column] > lawn[row][column]:
                     vle = False
             for fcolumn in xrange(m):
                 if lawn[row][fcolumn] > lawn[row][column]:
                     hle = False

             allle = hle or vle

             if not allle:
                 break
         if not allle:
             break

     result = 'YES' if allle else 'NO'
     print 'Case #' + str(ncase + 1) + ': ' + result

I got stuck for sometime, but in the end i found and coded quite elegant solution in python. We have a matrix and every element e[i,j] on it had to satisfy the following constraints:

  1. All elements e[i,k], with k != j must be e[i,k] <= e [i, j]
  2. All elements e[k,j], with k !=i must be e[k,j] <= e [i, j]

If both constraints are satisfied, then the answer is yes, else we can’t cut the lawn.

The interesting part is the next round i’ll like to talk about. I didn’t participate on Round A, but i did try round B. As i told before, i never practice for this, and after trying quite a bit the first problem i got frustrated and decided to stop. In like 30 minutes, there were already a lot of people with 2 or more problems completed.

I want to talk about the first problem: Osmos. The problem was easy to understand, my mistake was getting so stuck in thinking “efficiently” and trying to find the best solution. In the end i made a bad assumption and never passed the runs. I stopped and reminded to view the solution later.

I viewed the solution found my error and learned a lot from the code i saw.

I’ll show my fixed code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from sys import stdin

def dp(arminmote, motes):
    if len(motes) == 1:
        return 1 if arminmote <= motes[0] else 0 
    elif arminmote > motes[0]:
        steps = dp(arminmote + motes[0], motes[1:])
    else:
         steps = 1 + min(dp(2*arminmote - 1, motes), dp(arminmote, motes[1:]))
    return steps

if __name__ == "__main__":
    ncases = int(stdin.readline())

    for ncase in xrange(ncases):
        case = stdin.readline()
        arminmote, leftmotes = map(int, case.split(' '))

        case = stdin.readline()
        motes = map(int, case.split(' '))
        motes.sort()

        if arminmote > 1:
            steps = dp(arminmote, motes)
        else:
            steps = len(motes)

        print 'Case #' + str(ncase + 1) + ': ' + str(steps)

The Round B winner code:

# dolphinigle
# GCJ 2013 1B
# 4 May 2013

ntest = int(raw_input())
for itc in range(ntest):
  print 'Case #%d:' % (itc+1),
  my_mote, n = map(int, raw_input().split())
  motes = sorted(map(int, raw_input().split()))
  if my_mote == 1:
    print n
    continue
  best_answer = n
  my_answer = 0
  for index, i in enumerate(motes):
    if i < my_mote:
      my_mote += i
      continue
    # maybe stop here?
    best_answer = min([best_answer, my_answer + n - index])
    while my_mote <= i:
      my_mote += my_mote - 1
      my_answer += 1
    my_mote += i
  best_answer = min([best_answer, my_answer])
  print best_answer

Quite the difference? Yes. First i didn’t see any polynomial way of solving, i just though that it was needed in some cases to choose between 2 possibilities and didn’t use any memorization or dynamic programming technique. I did fail quite hard, but it is important to learn from mistakes.

I saw why the first place on round B is this guys. In just 7 minutes he came with this brilliant solution. With one cycle and using two variables: one to count the motes needed to eat and the other the best solution so far, starting with removing all the motes. I was amazed and learned a lot seeing this code. I had never use raw_input and enumerate. They are so useful.

Learning and sharing is so important, specially all regarding computers. One of the reason i made this post is that. Hope it is useful to someone.

Some programming problems

I am not a big fan of programming competions, but i do love good challenges and interesting problems. Google Code Jam (https://code.google.com/codejam/) is comming soon and i decide to participate in a normal way. I am not that hardcore programmer 24/7.

For practice I did two problems: File Fix-it and All Your Base. As i was so lazy, i used Python. I want to comment my reasoning for solving each of the problems.

  • File Fix-It: The problem consisted on on telling the minimum number of mkdirs (it creates folders one by one) to create the given directories. In each case was given the already done directories and the ones we needed to create. Creating directories can be viewed as building a tree. I simulated that using a dictonary.
#!/usr/bin/python
# -*- coding: utf-8 -*-

from sys import stdin

def add_directory(directory_hash, path, done):
    directory = ''
    created_directories = 0
    for node in path.split('/')[1:]:
        directory += '/' + node
        if not directory_hash.has_key(directory):
            directory_hash[directory] = True
            created_directories += 1
    return 0 if done else created_directories

if __name__ == "__main__":
    ncases = int(stdin.readline())

    for ncase in xrange(ncases):
        case = stdin.readline()
        done_nodes, left_nodes = map(int, case.split(' '))

        directory_hash = dict()

        for done_node in xrange(done_nodes):
            add_directory(directory_hash, stdin.readline().replace('\n',''), True)

        created_directories = 0

        for left_node in xrange(left_nodes):
            created_directories += add_directory(directory_hash, stdin.readline().replace('\n',''), False)

        print 'Case #' + str(ncase + 1) + ': ' +  str(created_directories)
  • All Your Base: A little more complicated problem. The input was an alphanumeric string and the output a mapping each symbol of it to de minimum number, in any base, expresed as a decimal. The minimum base was 2 and the first symbol can’t be 0. The solution was quite simple, map the first symbol to 1, after that, map from left to right each diferent symbol to 0, then 2, then 3 and so on. Finally change the number to base 10.
#!/usr/bin/python
# -*- coding: utf-8 -*-

from sys import stdin

if __name__ == "__main__":
    ncases = int(stdin.readline())

    for ncase in xrange(ncases):
        case = stdin.readline()

        mapped_number = 0
        symbol_number = dict()
        symbol_number[case[0]] = 1

        for symbol in case[1:-1]:
            if not symbol_number.has_key(symbol):
                symbol_number[symbol] = mapped_number
                mapped_number = 2 if mapped_number ==  0 else mapped_number + 1

        if mapped_number == 0:
            symbol_number[case[0]] = 1
            case.replace(case[0], '1')
            mapped_number = 2

        to_pow = len(case) - 3
        result_decimal = 1 * pow(mapped_number, to_pow + 1)

        for symbol in case[1:-1]:
            result_decimal += (pow(mapped_number, to_pow) * symbol_number[symbol])
            to_pow -= 1

        print 'Case #' + str(ncase + 1) + ': ' +  str(result_decimal)

Both handled the large inputs and are quite efficient. The teaching here is that before doing any code is important to reason how to solve the problem.
Lets see how i perform in the Code Jam :3.