Vučić se servira onima sa platom do 300 evra mesečno

Ko šta radi, ja ispada samo pljujem Vučića po ovom blogu. Evo još jedan tekst, kao neki pokušaj analize podataka, ovaj put o tome ko sa kojom platom glasa za dragog nam Vučka.

BTW, ako vam se sviđaju ove moje debilne analize sa ovako random temama, onda će vam se sigurno svidi i datatata blog od mog kolege:) Sad kad sam završio sa reklamama:), da se bacimo na stvar.

Kako sam došao do ove analize

 

Na prethodnim izborima, Vučko je odneo ubedljivu pobedu. Svako može da priča sad o razlozima (čak sam se i ja okušao u tome), jedan od češćih je bio da su ljudi naterani da glasaju za njega, ili da tamo neki seljaci glasaju, a da je ta “građanska opcija” inertna itd., itd. Mene je zanimalo da vidim da li stvarno ima u tome nešto. Jedan od retkih otvorenih izvora podataka u Srbiji je republički zavod za statistiku. Tamo, duboko skriveni, da ih slučajno neko ne nađe (pu-pu-pu, daleko bilo), ili ne daj Bože skrejpuje (pa su često dostupni samo iz browser-a, nicht .csv, nicht Excel), su razni podaci, manje ili više zanimljivi. Razmišljalo je nas nekoliko šta bi to bilo najbolje i najzanimljivije gledati, ali ništa pametno osim plata nismo našli. Ako vi imate neki dobar izvor podataka, koji ima smisla ukrstiti nekako sa rezultatima izborima, javite!

Elem, na publikacijama za zaposlenost i zarade možete naći ZP14, tj. zarade po opštinama i gradovima, a najnoviji izveštaj u trenutku pisanja ovog teksta je iz maja. Malo sređivanja i lako dobijemo spisak svih prosečnih plata po opštinama. Na sajtu RZS-a se mogu skinuti i rezultati izbora po opštinama. Isto malo dorade, i ova dva izvora su spremna za ukrštanje. Mislim, da se ne lažemo, jasno je bilo da će biti korelacije, pitanje je samo bilo kolike:) Sređene i ukrštene podatke, kao jedan mali i pitki Excel, možete da preuzmete odavde.

Analiza

 

OK, definitivno postoji korelacija, jasno se vidi sa ove slike:

 

(klik na sliku za veću verziju)

 

Na X-osi je prosečna plata po mestima, a na Y-osi je procenat glasova za Vučića. Nažalost, nisam mogao da ubacim da se vidi svako mesto, ali većinu ovih “izbačenijih” (da ne kažem outlier-a) jesam (kliknite na sliku za veću verziju). Ako želite da vidite kako se kotira vaše mesto, pravac na download Excel-a i nađite se sami! Evo par stvari koje sam ja uočio (vi javite ako primetite još nešto interesantno):

  • Korelisane su babe i žabe u neku ruku. Prosečna plata je uzimana sa mesta gde su prijavljene firme, a procenat glasova za Vučka sa mesta gde su prijavljivani ljudi. Tako da nije baš tačno preslikavanje. Kako god, mislim da nisam mnogo omanuo, jer iako se na bolje povezanim mestima (npr. Novi Beograd↔Vračar) ljudi više kreću, postoji prostorna korelacija (prijavljen u Vlasotincu neće raditi u firmi koja je registrovana u Subotici)
  • Surčin je outlier jer je tamo kontrola letenja i aerodrom (a oni su poznati da nemaju baš male plate:). Ja bar tako objašnjavam
  • Od ostalih zanimljivih outlier-a, ukazao bih na Čajetinu – bastion slobodne Srbije:p (ne smatram beogradski trougao “Vračar-Novi Beograd-Stari Grad” nešto zanimljivim)
  • Trgovište je pobedilo:) Znam da nije politički korektno, ali svaki put kad vidim ovaj grafik sa njim na vrhu, malo se nasmejem:)
  • Od Mladenovca (ovde nije prikazan) i Belog sam očekivao da postanu outlier, ali nisu – Beli nije uspeo da izvede Mladenovac van mediokritetske 3σ korelacije
  • I na kraju – cum hoc ergo propter hoc. Samo zato što veza postoji, ne znači da je niska plata uzrok, već mnogo verovatnije postoji uzajmna povezanost ova dva. Drugim rečima, može se reći da siromašniji ljudi glasaju za Vučića (globalno gledano), ali ne možemo da kažemo da ako u Trgovište sipamo milione evra, da će ti ljudi da glasaju za nekog drugog. Ili obrnuto, da će Stari Grad postati siromašniji ako počne da glasa za Vučića – sve je to lepo uvezano i povezano, ovi grafici samo surface-uju mali stepen jednostavnosti u opštem haosu realnosti;) (da, znam, nekad mnogo kenjam)

 

A da vidimo trend:

 

R squared je 0.27. Za trend sam uzeo linearni (najviše mi je ličilo i najbolje je rezultate davao na ovom uzorku, mada je nemoguće da je ovo linearni model:). Po ovom trendline-u možemo da zaključimo da u mestu u kome je plata 0 dinara, procenat glasova za Vučića bi bio oko 85%, a spao bi ispod 50% kada je prosečna plata 80.000. Vučić (i njemu slični) će dobiti 0% glasova kada prosečna plata bude 210.000 dinara. Sada je valjda jasno odakle mi ideja za naslov teksta:) Ne znam za tebe čitaoče, ali meni ove brojke totalno imaju smisla.

Po oblastima

 

Korelacija je jasna i kada se pogledaju agregirani rezultati po oblastima:

 

 

Samo beogradski pašaluk

 

Ako niste iz Beograda, ovo će biti još jedan od onih “jebo ih Beograd” momenata (znam kako je…živeo sam van tog Beograda:), i slažem se da nema ništa više poente analizirati Beograd ovde nego npr. Niš, ali šta da radim – za Beograd ima više opština, lepše izgleda na grafikonima, a i moj je grad, mene je zanimao on najviše:

 

 

Kad gledate Beograd, podsetite se samo prve stavke gore kada sam pominjao mešanje baba i žaba – na Beogradu je to najizraženije, pa uzmite ovaj grafikon sa rezervom.

Posted in Analitika, Politika | Tagged | Leave a comment

Azure Functions iliti kako kupiti njivu preko SMS-a

Imao sam dva skroz dijametralno suprotna use case-a gde sam želeo da dobijem brza obaveštenja (po mogućstvu SMS) o “ponudama” (oglasima). Npr. hteo sam da kupim njivu, ali sam provalio da mi je OK da čekam da se njiva pojavi na doboš preko banke. Drugi use case je da sam hteo da kupim mobilni telefon. U prvom slučaju, nisam želeo da obilazim sajt banke svaki dan za nešto što se pojavi svaka 3 meseca, a u drugom sam želeo jako brza obaveštenja kad se u ponudi pojavi jako jeftin telefon (koji se obično proda za 2-3h od objavljivanja).
Rešio sam da oba problema rešim na isti način, koristeći što više iz Azure stack-a što je moguće. Ne, možda se nismo razumeli, stvarno sam overarchitect-ovao što je više moguće, koristeći što novije i što hipsterskije tehnologije.

<cinizam>

da se ne lažemo, sve ove tehnologije su stare koliko i računarstvo, samo su implementacije drugačije…

</cinizam>

Ovo nije bilo uzaludno iz dva razloga: prvo, naučio sam dosta novih stvari, a i ovakvo rešenje bi imalo smisla za neke veće sisteme, gde skaliranje stvarno može biti problem (ali ne, definitivno ne i za kupovinu njive).
Jedini constraint koji sam imao je da želim da isprobam Azure Functions – serverless ekvivalent AWS-ovim Lambdama. Međutim, bilo je tu par problema, pa evo da prođemo sve natenane.

Šta su Azure funkcije

(preskočite ovaj pasus ako ste upoznati sa Azure funkcijama, ovo je samo uvod)

Azure Functions (ili Azure funkcije, po naški), su Microsoft-ov odgovor na AWS Lambde. Način da se implementira stateless arhitektura što, u zavisnosti od vaše biznis logike, može dosta da uštedi para (plaćate samo izvršavanje funkcija, ali ne i ceo VM koji zvrji prazan ostatak vremena). Slanje SMS-a i skrejpovanje sajtova je baš dobar primer. Ako želite više da znate o Azure funkcijama, krenite odavde.

<cinizam>

Opet moram da se umešam. Nema ništa inherentno stateless u stateless arhitekturi, samo vi ne morate da brinete o tome. Sviđa mi se kako je to sročeno ovde: Serverless = “someone else is responsible for these servers going down”

</cinizam>

Ukratko objašnjenje je da vi pišete samo jednu funkciju (u jeziku u kom hoćete, dosta ih je podržano), i definišete šta je:

  • trigger (kada se funkcija startuje, može biti npr. tajmer, može biti kad stigne nešto na message queue…)
  • input (šta vam je ulaz u funkciju, može biti npr. blob, Document DB dokument…)
  • output (šta je izlaz iz funkcije; može biti sve što i input, ali i razne druge stvari, kao što su HTTP request, slanje mail-a, ili ono što ćemo mi ovde iskoristiti – slanje SMS-a preko Twilio platforme

Za sva ova tri gorepomenuta (trigger, input i output) niste ograničeni na samo jedan, već ih može biti i više od jednog (npr. output je u našem slučaju i Table Storage i Twilio SMS, videćete kasnije), i svi su definisani kao argumenti u toj vašoj funkciji koju pišete.

Ima tu još par gremlina na koje sam naletao, ali ih i dosta brzo rešavao uz pomoć dokumentacije i SO-a, tipa kako da dodaš nove nuget pakete, kako da funkcije dele zajednički kod, a pošto Azure funkcije koriste web apps ispod, sve ostala pitanja su dobila automatski odgovor (kao kako da postavim environment variable, kako da upload-ujem fajl FTP-om, a čak i Kudu radi sa adrese https://<function_app_name>.scm.azurewebsites.net).

Overall, rešenje je bilo prilično očigledno – skrejpuj sajt unutar Azure Functions-a, sačuvaj to negde i pošalji SMS. Al’ ne lezi vraže…

Gde čuvati state sistema u stateless arhitekturi

Prvi problem ovde je bio gde čuvati state sistema, tj. gde čuvati već obrađene njive, odnosno telefone. Od ponuđenih opcija, Azure Functions je nudio Document DB i Table Storage, ali ne i npr. SQL. Hteo sam nešto jeftino i lightweight, pa sam se odlučio za Azure Table Storage (Document DB mi je bio preskup za ovu namenu, mada mislim da bi on bio bolji izbor kad bi ovo trebalo da skalira).

Jedna zanimljiva stvar na koju treba paziti i koja može da vas ujede je da Azure Functions može pokrenuti vašu funkciju više puta konkurentno. Mislite o tome ako vam treba atomičnost. Prost primer je da se dve funkcije pokrenu paralelno, u funkcijama se skrejpuju iste ponude, onda se provere u obe da li postoje u Table Storage-u, i pošto ne postoje, da se pošalje SMS dva puta iz obe. Pazite se!

Kako merge-ovati ponude

OK, sada kad znamo gde je state, kako Azure Function-u da kažemo da ne želimo da ubacimo novi red ako postojeći već postoji. Ispostavlja se da tako nešto ne postoji (što ima logike pošto Azure Functions radi samo sa nekim ulaznom i izlaznom komponentom, ne sa hipotetičkim ulazom koji zadovoljava neki query). Tu sam morao da zasučem rukave i da napravim upit koji će proveriti da li postoji već takva ponuda u Azure Table Storage-u. Da se razumemo, nije ovo teško, nego nije bilo u duhu Azure funkcija. Što se tiče primarnog ključa u Table Storage-u, on podržava dva odvojena entiteta – PartitionKey i RowKey (pretpostavljam da su značenja jasna iz imena). Nekako mi je bilo logično da za PartitionKey postavim sajt sa koga skidam ponudu, tj. tip ponude (njive, telefoni, avioni, kamioni…), a da RowKey dobijem od sajta i da on bude specifičan za datu ponudu.

Ima ovde još jedna bitna stvar vredna pomena. Azure funkcije rade po principu da, ukoliko ste naveli neki output (Table Storage, u našem slučaju), on nije opciona stvar, već Azure očekuje da prosledite novi red i tačka. U ovom slučaju, mi želimo red samo ako ponuda ne postoji već u tabeli. Srećom, Azure funkcije podržavaju ovo, i to tako što output funkcije nije objekat T koji se upisuje u tabelu, već ICollector<T>. Ovakav napravljena kolekcija dozvoljava da krajnji izlaz bude i 0 redova, ali i više od jednog reda!

Decoupling različitih tipova ponuda

Osnovna ideja je da možemo da imamo različite tipove ponuda (njive, telefoni) koje prolaze kroz sistem. Malo bi glupo bilo da sva skrejpovanja svih sajtova budu u istoj Azure funkciji. Takođe, period skrejpovanja za mobilne telefone (npr. 15 minuta) nije isti kao i za njive (max. jednom dnevno). Naravno, kad god je ovakav decoupling u pitanju, uvek je odgovor naš dobar drugar message queue. Azure Functions podržava taj scenario (queue može da bude trigger), a Microsoft-ovo rešenje se zove Azure Service Bus. Naravno, format ponuda koje ćemo trpati u queue će biti JSON (zato što je XML so 1990s). Sada je iz aviona jasno rešenje – napraviti N različitih Azure funkcija, gde će svaka da skrejpuje po jedan tip sajtova sa ponudama, trigger će im biti tajmer, a output će im biti Azure Service Bus. JSON koji pumpaju u Service Bus može da bude kakav god dictionary, ali mora da ima ključeve “partition” i “id” u njemu. Sa druge strane queue-a je funkcija kojoj je trigger Service Bus, a kojoj je zadatak da čita ponudu iz Service Bus-a, proveri da li zadata ponuda postoji već u Table Storage-u, a ima dva output-a – jedan je opet Table Storage (opcioni, i ima ga samo ako se pojavi nova ponuda koja treba da se upiše), a drugi je, takođe opcioni Twilio SMS servis.

Krajnje rešenje

Posle svih ovih problema, evo i šematski prikaz kako izgleda ovaj moj overarchitect-ovani Frankenštajn:

Jeste ružno, ali i treba da bude ružno. Inače, cela ova zezancija košta oko 1$ mesečno (plus 1$ za Twilio SMS servis), što je mnogo manje nego da sam uzeo VM na koji bih npr. potrpao neki Python plus Mongo. Nisam ekspert, a nisam ni probao AWS Lambdu da bih dao bolji pregled i imao bolju referentnu tačku, ali evo stvari koje treba da se unaprede, po meni:

  • Podrška za SQL (ima nas matorih koji bi ipak koristili SQL) i druga NoSQL rešenja
  • Podrška za lokalno testiranje (to danas nije moguće)
  • Twilio SMS servis nije radio, morao sam “ručno” da ga dodajem (ovaj OOB je imao problema) – loš prvi experience za Micorosft i Azure.

Ako vas zanima neki detalj o ovome što sam pričao, ceo source code je dostupan na GitHub-u. Ako vas ne zanima nijedan detalj i sve vam je jasno, možete uvek otići na ovaj link da vidite moj broj telefona i da mi me pozovete i zahvalite na kristalno jasnom opisu Azure funkcija:)

Evo i za kraj kako to izgleda na portalu:

Posted in Programiranje | Tagged , , | Leave a comment

Kapilarni prsti neminovnosti

Ovo je još jedan tekst u navali tekstova na koji ćete naići ovih dana koji pokušavaju da analiziraju gde smo to “mi”, građani pogrešili. Još jedan u nizu koji ćete čitati klimajući sa odobravanjem, dok prelistavate share-ove vašeg malog bubble-a ljudi, bubble-a koji ste napravili na Facebook-u, Twitter-u, dok jedete vaše organske salate i vegeterijanske poslastice. I pitate se kako je opet krezubi Sale Nađ iz Malog Iđoša uspeo da nas nadglasa, ovaj put ovako efektno. Pa prosto, zato što je Sale u pravu!, a evo i u čemu je stvar.

Prva greška u razmišljanju je da smo mi iznad Saleta i da je naš izbor edukovaniji, pametniji, lepši…nije. A ovo često radimo, neverovatan je unconscious bias koji viđam u mom balonu, pre svega (a i ja to ovde radim, sa namerom karikiranja). To nas dovodi do druge stvari na koju smo potpuno slepi, a to je da su naše potrebe na Maslovljevoj hijerarhiji potreba mnogo zadovoljenije od jadnog Saleta Nađa – dok mi razmišljamo o kupovini čia semenki i povređenoj vladavini prava u Hercegovačkoj, Sale ne zna šta će jesti sutra – ali bukvalno! To je taj odnos između long-term i short-term razmišljanja. Možda Saša Janković i jeste dobar long-term izbor za Srbiju, ali 55% ljudi očigledno ne zanima taj dugoročni izbor, već ti prsti neminovnosti, neminovnosti sutrašnjeg dana. I ako tako pogledamo, Saša Janković (pa čak ni Beli) tom napaćenom Saletu nije ništa ponudio. Sale Nađ se batrga i davi u okeanu neizbežnosti, voda mu ulazi u nos i uši, očajnički u ropcu traži malo kiseonika, a jedino što čuje kroz huk vode u ušima je “Za vladavinu prava. Za građanske slobode. Da se ne plašite. Da Vam vratim osmeh na lice”. E, pa jebeš to, Saletu to ne treba! Kao da stojiš pored davljenika i umesto da mu daš ruku, ti mu pričaš da svi učesnici plivanja u moru moraju da prođu obavezni kurs prve pomoći. Ono što Saletu treba je jedan lep dogovor između Vučić D.O.O i njega, gde će on za glas toj narastajućoj masi da obezbedi još par dana oduška od svoje short-term neminovnosti, a možda i više. Gde će on zauzvrat dobiti tu kapilaru, taj glas, i proslediti onda taj kapilarni glas dalje, a znamo šta kapilare rade – donose preko potrebni kiseonik tkivu koje odumire. A Saletu davljeniku treba kiseonik. A da se ne lažemo, SNS više nije stranka, nije čak ni koalicija stranaka, to je uigrano preduzeće, sistem, uigrana narastajuća masa koja baš i zavisi od tog tkiva koje odumire.

I zato, svi vi pametni građani i građanke, krem ovoga društva, koji ste ogorčeni na glupog Saleta, dok pijuckate vaš kapućino na Obilićevom vencu, razmišljajući o tome kako malo smanjiti kredit za stan i broj kilograma, ne ljutite se na Saleta Nađa – on nije kriv, on je uradio šta je morao i uradio je najpametnije za njega. Ne gledajte ga sa visine na toj Maslovljevoj piramidi potreba, samo zato što je on na njoj niže. Da parafraziram onu dečiju pesmicu: “Lako je intelektualcu da se sokoli, dok traži građanska prava, njega ne boli”. Ti ćeš sledeće nedelje da protestuješ šetnjom kroz grad sa decom, pa posle malo u Mek, a on će žutu patku možda videti jedino na stočnoj pijaci u Bačkoj Topoli dok bude prodavao rezani duvan.

I nemojte mi počinjati sa tim da je Sale sam kriv – nije. Sale je jednostavno ispod 50-og percentila životnog standarda i da nije on, bio bi neko drugi. Možemo diskutovati da li je sistem podešen tako da Sale nema izbora, ili zašto nam je ta medijana životnog standarda tako niska, ili zašto nije bilo kandidata koji bi obećao možda malo kraduckanja u zamenu za malo poboljšavanja sistema, ali jedno je definitivno – sistem nećemo promeniti bez Saleta Nađa (a ne vidim da autistični Saša Radulović i Saša Janković imaju taj kapacitet i tu harizmu). Izvinjavam se što ovde ne ulazim u predloge rešenja, ali to je tema za poseban tekst.

Na kraju, jedino što me plaši je da se kapilare ne prošire previše i ta narastajuća masa ne dobije previše kiseonika. I eto nama onda metastaza. I da Srbija onda ne završi kao Sale Nađ iz istoimene pesme.

Posted in Politika | 2 Comments

What football odd is best to play on big competitions

Min or max, that is the question

 

So, for the past 8 years or so I’ve been having small pool betting web site for me and my friends – nothing fancy, around 50 people at most, and we bet only on final result (1, X, 2). Number of points you get is equal to decimal odd of your bet (so, if you play odd with value of 1.61 (or 8/13 in fractional), you get 1.61 points). It is the same as playing singles with 1 stake on each game. Whoever has the highest number of points, wins! So, I fire up this site every 2 year – either for World Cup or for Euro cup. I am writing this right after the Euro 2016 finals and we have our new winner:) (and as you probably guessed correctly, he won mostly because he placed a lot of bets on Island and that brought him a lot of points:). So, question I got a lot since this pool betting is alive, and especially during competitions is:

 

If I play only minimum (or maximum) odds throughout whole competition, will I be the first one?

 

Short answer, for those who don’t want to read too much is – unfortunately no. Playing small odds will definitively minimize your risk, but will not get you profit. Playing only maximum odds will also not get you far; of course, depending how weird competition was. Actually, not only that both of these will not get you to be top 1 in my pool bet, but it will not even get you to be positive if you played it for real! So, let’s look at Euro2016. If you played only minimum odds, you would get 36.94 points (sum of all winning odds). To put this into perspective, if you gave 1 for every match and there was 63 matches total, you would get back only 36.94, so you would lose more than 26. Definitively not a way to go. Let’s look what would happen if you would play only maximum odds – you would be far better – you would get back 56.35 ( or points). So, you would end up in positive profit, e.g. earn money! I did same analysis for World Cup 2014 and all results are in the table below:

 

Competition Games
Odds playing Points Profit Hypothetical place in my pool bet
World Cup 2014 64 Min 67.05 +3.05€ 16/68
World Cup 2014 64 Max 38.35 -25.65€ 64/68
Euro 2016 51 Min 36.94 -14.06€ 56/70
Euro 2016 51 Max 56.35 +5.35€ 16/70

 

As you can see, on World Cup, it was far better playing minimal odds, while on Euro2016, it was better playing high ones. I guess it can be explained as Euro2016 was far more volatile than World Cup 2014. Here, you had so many surprises – Island, Wales, Hungary, even Albania in first round. Looking at World Cup 2014, I cannot think of any outsider beating the odds except Costa Rica, if you remember. All in all, I would say that:

  • Playing only minimal/maximal odds will not make your profit positive
  • If you know somehow in advance whether competition is going to be volatile or not, you could minimize your risk

 

Then what, if not min or max?

 

So, we are constraining ourselves if we say we want to pursue only minimum or maximum values. What about X (draw) which is usually somewhere in between. How can we create algorithm to tell us should we play X (and not min/max on favorite or outsider team)? Or does value for X matters? Maybe we should pursue only odds around certain values?

This is exactly my train of thought how I ended up with following analysis. Imagine if we said we want to bet only on odd Y. Now, when you have 3 odds for 1, X, 2, usually none of them will be equal Y. However, some of them will be closer to Y than others. So, what I did is generated hypothetical achieved points, based on all possible odd values Y. Here is simple example, imagine if Y is 4.5. If you have odds 1.1, 4.2 and 5.0, you would bet on 4.2, as it is closest to 4.5. Note that this approach is generalization of minimum/maximum question that everyone is asking about (if we put Y to be low (1.0) or really high (10.0), it is the same as if you bet on minimum or maximum odd, respectively). So, let’s plot now how our points would change based on odd we want to bet:

points_taken_wc2014points_taken_euro2016

On X-axis is odd we want to bet as closely as possible. On Y-axis are hypothetical amount of points if you would bet on those odds. Orange line is equal to number of matches and is boundary to being positive/negative on betting. I also added couple of callout values, so it is easier to see where hills/dips are. Obviously, data sample is very small and there is no statistical significance here. Only thing left (other than adding odds from Euro2012 etc. which I do not plan to do) is to sum those two competition and treat it as bigger one. For example, around odd values 3.x and 4.x are some small hills on both competitions, and my guess was they will be exaggerated on summed:

points_taken

Well, this does makes more sense to me overall:

  • You are rarely positive (above orange line) and I can argue this is just a statistical error
  • Minimum and maximum are leveling to be same
  • There are some peaks and dips around 3 and 4

Conclusion

 

C’mon, you didn’t really believed there is something under the data that will help you win big time?:) With more data (more competition), my take is that this line will flatten itself. This makes sense as odds are nothing but probabilities, and be it 1.01 or 10.0, if you play long enough exact same odd, you will be (almost) at zero. However, if you want to minimize your risk, maybe, just maybe, try playing odds around 3.2 and 4.0 (but avoid odd 3.6 as much as you can:)

Posted in Analitika, Uncategorized | Leave a comment

Statistike B92 vesti i komentara u 2015. godini

Autor ni na koji nači nije povezan sa B92. U tekstu su iznete isključivo činjenice dobijene statističkom analizom. Ni originalni podaci, ni njima dobijeni rezultati nisu modifikovani.

Osnovni podaci

2015. godina je prošla. Sa malim zakašnjenjem, ispod je prikazana osnovna analiza sajta B92 i vesti koje su se našle na njemu, kao i komentara i kategorija. Analizom su obuhvaćene sve vesti i komentari objavljeni na B92, od 01.01.2015. do 31.12.2015. Celokupni izvorni kod scraper-a, kao i cela baza podataka se nalazi ovde.

U 2015. godini, B92 je izbacio dosta vesti, i dosta komentara, evo kratak pregled:

Ukupno vesti 100.971
U proseku, vest na svakih: 5 minuta
Ukupno kategorija: 173
Komentara: 1.643.338
U proseku, komentar na svakih: 20 sekundi
Prosek komentara po vesti: 16.2
Ukupan broj pluseva: 175.415.492
U proseku, u jednoj sekundi je padalo: 5.5 pluseva
Ukupan broj minusa: 70.796.238
U proseku, u jednoj sekundi je padalo: 2.4 minusa

 

Pređena je magična cifra od 100.000 vesti godišnje. Čestitke za B92:) Gledajući broj komentara sa ove liste (a pošto su komentari moderisani), ova statistika govori i da su moderatori imali pune ruke posla (zamislite koliko komentara tek nije prošlo moderaciju).

Vesti

 

Evo koje tipove vesti je B92 objavljivao, po kategorijama:

 

vesti_po_kategoriji

 

Ako se gleda kako je B92 objavljivao vesti na dnevnom nivou, primećuje se porast trenda.
vesti_dnevno

 

Na početku godine su izbacivali 263 vesti dnevno, a na kraju čak 288. Ako se nastavi ovaj trend, do 2021. godine će objavljivati 150.000 vesti godišnje ili preko 400 dnevno (ne znam kakve će to vesti biti, ali se nadam da neće porasti broj vesti u kategorijama “politika” i “hronika”). Na prethodnoj slici se i jasno vidi da broj vesti na dnevnom nivou fluktuira od dana do dana. I tako i jeste – broj “vrhova” na prethodnom grafiku je 52, tj. broj nedelja.

 

Evo kako izgleda nedeljna distribucija vesti:

 

vesti_nedeljno

 

Valjda i B92 odmara vikendima.

 

Mnogo interesantnije od ovoga je videti kakva je distribucija vesti po satima, tj. kad se najčešće objavljuju vesti:

 

vesti_sat

 

Na grafiku se vidi kako je vrhunac vesti oko 11h prepodne, kao i dva peak-a – jedan u 17h i jedan oko 21h. Pretpostavka je da ovo nije slučajno. verovatno je rađena analiza i verovatno ljudi tad i najčešće čitaju vesti.

 

Međutim, znajući od malopre da distribucija vesti po danima nije jednaka, hajde da vidimo opet distribuciju vesti po satu, ali razbijenu po danima:

 

vesti_sat_dan

I zapravo, vidi se razlika. Očigledno je da je peak vesti vikendom drugačiji, tačnije 16h je udarni termin (dok je radnim danima to malo pomereno na 17h). Isto tako, subotom se vidi blagi skok oko 20h (dok je radnim danima on oko 21h), a i nedeljom se takođe vidi blagi skok oko 22h koga nema uopšte drugim danima.

 

A šta su bile najčešće teme na objavljenim vestima? Urađena je analiza reči koje su se pojavljivale u naslovima vesti. Ako izbacimo predloge i veznike (“i”, “ili”, “na”, “u”, “ako” i sl.) i ako se ne pravi razlika po raznim mogućim oblicima pojavljivanja reči (“Vučić”, “Vučića”, “Vučićeva” …), evo je lista najčešćih reči:

 

naslov_reci

 

Pozicija na kojoj se nalazi “SAD” je prilično fascinantna. “Godina” i “dan” su prilično standardni pojmovi, razmišljano je i da se izbace, ali nije na kraju. “Vučić” je i dalje neprikosnoveni vladar medijskog prostora, a godinu je obeležio i sve veći broj “izbeglica”. Interesantno je i da je “Zvezda” ispred “Partizana” globalno, a videćemo kasnije i detalje. “Novak” je uspeo da se ušunja na listu na poslednjem mestu (ovde nisu računata i pojavljivanja reči npr. “Đoković”), mada je lični utisak da je on držao prvo mesto cele godine:)

 

Ako zagrebemo malo više po ovoj listi tako što je razložimo po kategorijama, dobijamo malo veći nivo detalja:

 

naslov_reci_kategorija

 

Ostavljam čitaocu da iznese zaključke za svaku od navedenih kategorija, a ima ih, nije da ih nema.

Komentari

 

Pored vesti, analizirani su i komentari čitalaca. Kao što je gore već navedeno, pričamo o neverovatnoj cifri od preko milion i po komentara ili u proseku 16.2 kometara po vesti. Reći tako nešto a da se ne pokaže distribucija tog broja nije fer, pa evo:

 

distribucija_broja_komentara

 

Ovaj grafik pokazuje koliko vesti ima koliko komentara (sa 0 komentara su oko 24.000 vesti, jedan komentar ima malo preko 10.000 itd.). A evo koje kategorije u proseku izazivaju najviše komentara (u obzir su uzete samo kategorije preko 50 vesti):

 

Kategorija Ukupno vesti Ukupno komentara Prosek komentara po vesti
Eurobasket 378 19041 50.37
US Open 2015 293 13187 45.00
Wimbldon 2015 – Ozmo na travi 47 2052 43.65
Košarka 3704 159725 43.12
Roland Garros 2015 303 12422 40.99
Seks 44 1680 38.18
Politika 5305 198124 37.34
Drugi pišu 80 2822 35.27
Tenis 2462 83489 33.91
Wimbldon 2015 – Vesti 332 10182 30.66
NBA 1075 30372 28.25
Australian Open 2015 372 10011 26.91
Život – Vesti 5157 133710 25.92
Pregled štampe 111 2843 25.61
Nauka 140 3345 23.89

 

U principu, Srbi očigledno najviše vole da komentarišu seks, i to samo onda kad se umore od komentarisanja tenisa – valjda mislimo da smo u ovim oblastima najupućeniji da ostavimo komentar. Na dnu ove liste (nije prikazano ovde) se ubedljivo nalazi Bulevar koji je na preko 2400 objavljenih vesti uspeo da dobije ukupno… 10 komentara.

 

Ako pogledamo kad to ostavljamo komentare, vidimo sličnu distribuciju kao i kod vesti:

 

komentari_dan_nedelje

 

Ovo nam ništa ne govori. Ako uporedimo ovaj grafik sa prethodnim, možemo da vidimo relativan odnos “koliko vesti dođe nekog dana”, a “koliko se te vesti komentarišu”, pa dobijamo:

 

komentari_dan_nedelje_relative

 

Ljudi “ne stižu” da iskomentarišu sve vesti tokom radnih dana, ali zato sve nadoknade vikendima, naročito nedeljom, kad deluje kao da fali vesti. Kad se pogleda slična analiza po satima, kao za vesti, tj. break-down po satima kad ljudi najčešće komentarišu, dobija se slična kriva:

 

komentari_sat

 

Kometari manje-više prate izlaženje vesti. U tu svrhu, napravljen je histogram koji pokazuje posle koliko vremena (u minutima) od objavljivanja vesti dolaze komentari:

 

distribucija_komentara

 

OK, ovde je prikazana distribucija za jedan dan (1440 minuta), pa se ne vidi najbolje maksimum. Kad zumiramo malo bolje, dobijamo:

 

distribucija_komentara_zoom

 

Ispada da najviše komentara na vest dolazi 30 minuta od njenog objavljivanja. Da li je to prosečno vreme čitanja vesti plus pisanje komentara pre nego što se ostavi komentar? Na osnovu komentara koji se mogu pročitati svaki dan na B92 – teško; deluje da nekad ljudi ne pročitaju ni naslov do kraja pre nego što nešto iskomentarišu. A ko su ti komentatori uopšte? Ako pogledamo imena prvih 10 autora sa najviše komentara, dobijamo prilično dosadnu listu:

 

autori

 

Osim što nam govori da su autori najčešće muškarci (štaviše, prvo žensko ime ne kreće tek od 20. mesta), ne možemo da ih povežemo sa konkretnim ljudima. Zato su izbačena sva “uobičajena” imena (gde je primenjena “šac” metoda šta je to definicija uobičajenosti), pa nova lista prvih 15 autora izgleda ovako:

 

autori2

 

Čestitke za “smuleco-a”, ko god bio – izdominirao je sa 5257 napisanih komentara u 2015. godini. Botovi, ne menjajte imena, pa ćete i vi možda biti na ovoj listi. A evo i koji autori ima najbolje komentare, tj. one sa najviše pluseva:

 

Autor Broj komentara Prosek pluseva
marko (dorcol) 53 331
sasacg 84 292
nemanjabb 220 264
lion 128 251
markiz 83 242
theriddler 54 241
dexr 72 240
gajetano 190 239
expx 52 238
paspalj 51 234

 

U obzir su uzeti samo autori sa preko 50 komentara. A evo to isto, samo za najomraženije autore:

 

Autor Broj komentara Prosek minusa
herr wolf 52 -253
menader 78 -220
ruža 66 -219
tamni vilajet 82 -214
baba 52 -198
vanja petrovic 53 -185
antiparazit 53 -180
fedex1 58 -178
zimzeleni 156 -174
dexr 72 -171

 

A koji su to komentari najviše pogodili čitaoce da im oni daju plus. Evo je lista top 10 komentara:

 

Vest Autor Komentar Pluseva
Stefanovic: Vucic prošao poligraf, Branko je Saša aco haha imali smo svasta u proteklih 25 godina ali ovo je neponovljivo 6628
Stefanovic: Vucic prošao poligraf, Branko je Saša Kol Pretpostavljam sa najboljim ocenama! :-D 5432
Vucic: Ne dam Gašica i Loncara Bane Kakav demagog… 5359
Stefanovic: Vucic prošao poligraf, Branko je Saša Mxyed A zašto nije bilo direktnog prenosa ispitivanja ? :) 5171
Vucic: Ne dam Gašica i Loncara …. ako ne das njih,onda ti daj ostavku! 5140
Prostakluk ministra Gašica / VIDEO Persa To je on! To su oni! 4961
Vucic: Ne dam Gašica i Loncara strahinja Šta smo mi bogu zgrešili? 4494
Vucic: Ne dam Gašica i Loncara grbovic Nije problem sto je vojska htela da spase dete, vec je problem sto naredjenje izdaje nestrucan kadar. Dosta vise demagogije. Sta mislite da je narod lud, da ne zna da je zarad politickih poena nastradalo 6 osoba. A ti Vucicu ne moras da ih das. Narod ce Vam sve reci na sledecim izborima. 4462
Prostakluk ministra Gašica / VIDEO Miki Ne razumem u cemu je problem?! Sta ocekivati od takve osobe, koja je iz kafane i gradilista dosla u politiku na neposten nacin. Nije odgovarao za poginule u helikopteru, zasto bi mu bio problem da bilo kome bilo sta kaze. Sutra ce se pojavi i ubedi svoje glasace da je to sve umontirano, namesteno i izvuceno iz konteksta. 4300
Toni Bler u Srbiji, ministri cute Veteran Branio sam svoju zemlju 1999. god od NATO agresora, ciji je lider bio i Toni Bler. Osecam se osramoceno danas. 4282

 

I ista takva lista za najomraženije komentare (sa najviše minusa):

 

 

Hall of Fame

 

I na kraju, napravljen je pokušaj da se izvadi lista “najpozitivnijih” i “najnegativnijih “vesti, i probano je sa dosta raznih pristupa, ali nikad nije dobijena neka smislena lista. Da li su to vesti koje imaju najviše pluseva na komentarima, ili one koje imaju najviše prosečno pluseva, ili one kojima je odnos pluseva i minusa najveći – suština je da nema dobre metrike da se ovo nađe. Ipak, dok je ovo traženo, nađene su neke vesti koje od ostalih iskaču po raznim kriterijumima, pa će one biti prikazane. Ove vesti takođe daju dobru retrospektivu godine. To je sve, uživajte!

 

Vesti sa preko 1000 komentara

 

 

Vesti sa preko 150.000 pluseva na komentarima

 

 

Vesti sa preko 120.000 minusa

 

 

Vesti sa preko 600 pluseva u proseku na komentarima

 

Vesti sa preko 650 minusa u proseku na komentarima

 

Vesti sa preko 110.000 razlike između pluseva i minusa u komentarima

 

Vesti sa preko 25.000 razlike između minusa i pluseva u komentarima

 

Vesti sa preko 500 razlike između pluseva i minusa u komentarima u proseku

 

Vesti sa preko 300 razlike između minusa i pluseva u komentarima u proseku

Posted in Analitika, Politika | Tagged , | 10 Comments

(Probably) my most complex line of code ever written

One line of code I am going to present here is one of the most complex line of code that I might have ever written. Goal was to import StackOverflow’s questions and answers to MongoDB for further analysis. You can find whole dump of StackOverflow in XML format here. When you unpack it, it requires 8 lines of code to load it to MongoDB:

1
2
3
4
5
6
7
8
from pymongo.mongo_client import MongoClient
import xml.etree.ElementTree as etree
if __name__ == '__main__':
    db = MongoClient('localhost', 27017)['so']
    for event, elem in etree.iterparse('/home/kokan/Posts.xml', events=('end',)):
        if elem.tag != 'row': continue
        db.entries.insert(elem.attrib)
        elem.clear()

And this is literally whole program!

However, what you might notice is that all fields end up as strings in MongoDB. Somebody might not care and just live with this, but I have OCD, I just couldn’t let that happen. So, I started looking at all attributes in XML and figuring out their types. It turns out we have strings, integers, dates and even one list (it was attribute “Tags” which is in format “<html><css><css3><internet-explorer-7>”). My first reaction is to add code like this:

for key,value in elem.attrib.items():
    if key == 'Id':
        elem.attrib[key] = int(value)
    elif key == 'CreationDate':
        elem.attrib[key] = dateutil.parser.parse(v + 'Z')
    elif key == 'Body:
        pass # this is already string
    ...
    else:
        print('Unknown key %s with value %s' % (key, value))

You can see where this is going…So, I wanted to have a way to execute preprocessor logic applied to any given key to cast it from string to its real type. Another requirement was not to miss any key, e.g. I should have list of all used keys, so if any new key pops up, I can examine it and determine which type it is before rerunning script. Here is my end result – typed import in 23 lines of code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
INTEGER_KEYS = ('Id', 'ParentId', 'LastEditorUserId', 'OwnerUserId', 'PostTypeId', 'ViewCount', 'Score', 'AcceptedAnswerId', 'AnswerCount', 'CommentCount', 'FavoriteCount')
STRING_KEYS = ('Title', 'LastEditorDisplayName', 'Body', 'OwnerDisplayName')
DATE_KEYS = ('CommunityOwnedDate', 'LastActivityDate', 'LastEditDate', 'CreationDate', 'ClosedDate')
LIST_KEYS = ('Tags')
 
def warning_nonexistant_key(key, value):
    print('Unknown key %s with value %s' % (key, value))
    return value
 
PREPROCESSOR = {
    INTEGER_KEYS: lambda k,v: int(v),
    STRING_KEYS: lambda k,v: v,
    DATE_KEYS: lambda k,v: dateutil.parser.parse(v + 'Z'),
    LIST_KEYS: lambda k,v: v[1:-1].split('&gt;&lt;'),
    '': warning_nonexistant_key 
}
 
if __name__ == '__main__':
    db = MongoClient('localhost', 27017)['so']
    for event, elem in etree.iterparse('/home/kokan/Posts.xml', events=('end',)):
        if elem.tag != 'row': continue
        db.entries.insert(dict([key, PREPROCESSOR[next((key_type for key_type in PREPROCESSOR if key in key_type), '')](key, value)] for key,value in elem.attrib.items())
        elem.clear()

Brief explanation – I created dictionary PREPROCESSOR where keys are tuples of all keys in XML of a given type, and value is lambda function that knows how to cast values from string to its own type. Key line here is 22. What it does is – for each XML attribute, it tries to find that value in each tuple of each key in PREPROCESSOR and if it finds it, it executes proprocessor lambda. If it doesn’t find it, it executes default error message and returns unmodified value (as a string). There is so much in this line – list comprehension, dictionaries, tuples, lambdas and couple of awesome and cool built-in functions. If we are going to unwrap it, it would look something like this:

entry = {}
for key,value in elem.attrib.items():
    found_key_type = ''
    for key_types in PREPROCESSOR.keys():
        if key in key_types:
            found_key_type = key_type
    cast_function = PREPROCESSOR[found_key_type]
    entry[key] = cast_function(key, value)

Don’t get me wrong, I would never write lines of codes similar to that in any production code, nor I would encourage others to do that, but this was fun, this was one-time only script and I wanted to push my (and Python’s) limits doing this. And it turned out pretty cool, admit it:)

Whole source code, if interested, is here.

Posted in Uncategorized | 3 Comments

Programming languages playground

Preface to second edition

This text is around 4-5 years old.  My old blog vanished and this is one of the posts that were on it. I am putting it back as it got very good critic review. Hack, somebody even translated it to some language so obscure that even Google can’t translate (try it for yourself, I think it is Burmese). Although that old, much of the things still holds true and when I read it today, I think that not has changed much.

If programming languages were kids

It’s summer. Day is sunny and all the kids went out to play. They all gathered at playground enjoying the beautiful day and we’re now going to describe some of them.

First kid that catches eye is one tall boy, larger then all the other kids and obviously older then all of them. His name is C. He’s casually dressed and is always smiling cheerfully. All the other smaller kids are swirling around him and he obviously enjoys playing with them. He knows he is the coolest kid there and that he got respect from all of them, but he is not presumptuous about it. He’s fast, his moves are sharp and intelligent and he likes to help other kids, knowing that they are helpless without him. Look, he just helped that kid Python to climb that tree. Python could climb that tree himself, but it would take him forever, and he asked C for help. C, smiling as always, immediately picked him up and put him on branch. He really is like older brother for all of them.

Speaking of brothers, C really have one younger brother. His name is C++. Actually, they are stepbrothers. C’s fathers had thicker beards then C++’s father, and C++’s father have less hair. You could see these two kids look alike, but C++ is little smaller and younger. C++’s father thought he could give C a little brother that will be better at handling objects at playground, but that will still looks similar to C, and although he succeeded, other kids still prefer to play with C. He’s also a bit fat for a child of that age and little slower then his older brother, but granted – he’s better at handling various objects. Reason for it’s slowness is maybe because he carries with him a lot of equipment. He got with himself a little shovel, rake and little plastic bin, and man, he even have a Swiss knife. Other kids looks in awe at C++’s tools and possibilities with it, but they also heard from elders that those tools he’s carrying are just burden and that it takes a skilled person to use all those tools properly and wisely, and that you may even cut yourself if you try to use them without any training, so that’s why they mostly like to play with his older brother.

We already mentioned Python kid. He’s one of the kids that often asks C to help him. He is fast and agile, but alas, there are some times when only C can help. He’s one small kid with shaggy hair and he enjoys doing things both nice and quick. He never stops, he’s restless, his trousers are scraped on knees from constant running, jumping and falling. In one word, he’s very dynamic. Because of his dynamic nature, he often breaks things, but he can also very quickly put them together because he always carries duct tape with him. There is one more thing he always carries with him. If you ask him to do something, he’ll get to work immediately, no matter how hard that is and when the job is finished, he likes to take out couple of batteries from his pocket and to shout childishly ”batteries included!”.

Another kid similar to Python is Ruby. They are both very similar, but interestingly, they like to compete. If one of them does something, other kid will try to do the same, but quicker and nicer, showing that it’s better. They are even dressed similar, except Ruby likes to wear red.  When they have to do something in parallel and Python is faster then Ruby, he likes to say that Ruby’s red shirt is in fact woven from green threads. If, on the other hand, Ruby wins, he can then jump all day around Python making fun of him by shouting “Global Interpreter Lock!”. Ruby’s dad is from Japan and he really likes his kid. He’s so protective, and sometimes he worries so much about security of his child, that his child can’t develop normally because of him. C is also like older brother to Ruby and helps him a lot when he’s stuck.

PHP is one of the weirdest kids out there. He’s there, loves C, but rarely hangs with him or asks for help, mostly because he never needs it. He’s smallest, but also one of the fastest, most alive and most popular kids out there. Python and Ruby want to be as popular and fast as he is, when they grow up. Also, PHP doesn’t respect anyone, he is kind of rebel and has his own ways. One time, for example, kids wanted to build sand castle. They all gathered and started to talk how to do it. They mentioned “frameworks”, “scalability”, “paradigms”, “design patterns” and all those other stuff kids talk when they build a sand castles, and suddenly, in the middle of talk, they turned and saw that PHP already built his castle. He just said “Architecture – who needs that” and built it. It might not last long and you can’t build new floor on top of it, but ironically, his castle was better and more stable then any of the other castles.

Of course, not all the kids likes C and playing with him. There is this kid called Java. Although he is dependent of C, he think he’s better then him and doesn’t like to ask him for any help. Yes, he is respecting him, but thinks he can do everything on his own. He doesn’t like to play with other children and is very introvert. This is, because when he was younger, he was extremely fat and slow, and other kids always made fun of him. He is not accepted since then. All the kids remember when once one long bearded man, dressed like a hippie calling himself RMS, came to them one day and talked to them that they should avoid Java because he is not open to other kids and speaking on and on about trap they will fall into if they hang out with Java. This made Java grow inferiority complex. Inferiority complex soon developed into superiority complex and that explains his behavior a lot. Since then, Java tried hard to overcome that obesity problem he had and although he’s slim now, scars from bullying and wrinkled skin are still visible. Even today, he tries to be more open, but like it’s all in vain. He doesn’t even want to hear about the other kids, he created his own tools, his own toys that are not compatible with others kids toys, even his own part of the playground he calls open and accessible for everyone, and tries to lures other kids to join him, but other kids know that, once you enter his part of the playground, there is no returning back. Because of lack of other kids’ company, he artificially created his own kids from his special DNK called JVM and now plays with them.

There is one other kid who also thinks he’s too good to hang with C. His name is C#. He is just an ordinary kid, but he thinks that somehow, he is better then other kids. He wears corporate suit with pink tie and always keep his head high. He doesn’t speak with other children – they are all stupid and immature for him. Always surrounded with his fathers, who also wear corporate suits and forbid him to play with other kids. He is very spoiled because he’s very rich and his fathers buy him everything he wants. His suit is always clean because he really doesn’t want to play very much. If, for example, he needs to climb a tree, he just calls one of his dads and order him to buy him a ladder. Similar like Java, he has all the tools built for him by his fathers as people from community rarely donate any of the tools for him. Other kids and other kid’s fathers despise him because of his attitude and don’t want to have anything with him. Only ones that adores him very, very much are some other elders that also have corporate suits with pink ties, because they like that he is always safe and secure with his fathers.

Oh, I almost forgot one other kid. His name is Visual Basic. Unfortunately, he is retarded. He just sits all day long by the sandbox, with his head low, drooling in sand and hitting himself in head with his hand. Poor kid.

Posted in Uncategorized | Leave a comment

Programming to the people

On my work, as an exercise, we needed to take some time and come up with a “vision“. This is the complete, unedited text of the vision I always had and dreamt about, just now I needed to materialize it with the words and present it to collegues. I am sharing it now with the world.

My vision is simple.

My vision is that my boss, gets fired.

Did I get your attention? I hope I did, because when you skimmed over this text, you probably though „Oh my God, look how much this guy’s vision is long, who is going to read all that“. OK, let me clarify this some more for you – I want myself also to be fired; I want my colleagues to get fired; I don’t want programming to exist at all!

People to the programming

 

This may sound pretty radical to you, but this is intentional (and by the way, if it does – thank you). Joking aside, I want you to look at the current state of „computing“ or whatever you want to call it. Today, you get some box we call computer and this box is driven by something we call OS. On top of that, we got a bunch of programs we use to accomplish something. These programs are real little gems. Each and every one of them is designed to have some purpose. All of them are hand-crafted, nourished, polished and constantly taken great care of. All of them required man labor to create them and keep them updated. A real calories-burning, sweat-in-the-pants, hemorrhoids-in-the-butt type of labor. Never underestimate even the simplest of programs such as Notepad, as each one of them is a little piece of art; where every line is looked thousands of times, every condition is triple-checked, every bug is opened and resolved three times before finally closed! Is this sustainable? I think not. And that’s only one of the problems. Bigger problem, as I already said, is that all of those programs, no matter how big they are, have static purpose – they let you solve one problem. No matter how complex they are (think of Excel), you are the one that needs to drive them and you are the one that needs to be creative and thorough to solve your problem; programs are merely a helpers. And to make things even worse, your problems are usually not solved by one program alone – you need to switch constantly between something called “windows” to get your work done.

script

To conclude – programs are stupid helpers and you are the one doing all the work, not programs. You…you, my friend should make choices, not solving simple problems! So, I say:

Programming to the people

 

Imagine now what it would be like to have machine to be more than just “compute” in “computer” – imagine AI built on those machines, so advanced that it is capable to think as a human, but at the same time powerful enough to complement man in what we are not good at – “computing”. I am aware that this idea is not new and it’s just a cliché – almost every SF book or movie have AI embedded somewhere in it. But let’s forget for a moment Arthur Clarke’s HAL9000, Asimov’s Multivac or Skynet from Terminator and hundreds of others. They are all sweet and nice, but if you remove the characters and plots, or evilness from AIs, you are left with one idea – and the sole purpose of this vision is to emphasize this idea. Having AI means that your machine could be converted from set of unconnected, isolated, primitive tools to your real assistant – assistant that understand the context of problems you are trying to solve. Just imagine the freedom you could have when you could articulate what your problem is to AI and let it solve it for you. Implication to this are enormous. First of all, people will have more time to focus on their businesses and on important decisions, AI is the one that will do the boring stuff. Second of all, there will be no programming as a concept and no developers; in fact, everybody will be developer in some way. Mary from Southborough, England will not have to bug her shy geek friend to create her program to rename her pictures from vacation as date pattern (she would not have sex with him anyway) nor she would need to search on the internet programs called “Super JPEG Renamer 3000” just to extract EXIF data from pictures and batch rename them – she would just explain to her computer what she needs to accomplish. Deepak from Bhopal, India, avid writer and former developer, will not have to hire an IT company or fiddle with WordPress just to create his blog that explains why new AI technology sucks – he will ask computer to do all the boring stuff and he will just pick a domain, theme, options and start writing immediately. Larry from Louisiana, USA, also known as “Fatty McFatFat” on RottenTomatoes, 35 years old who is still living in a basement in his parents’ house always wanted to be a movie director – he doesn’t need big studio or fancy programs to create CG effects, he will direct whole movie from the comfort of his couch in the basement. Cristobal from Santiago, Chile have a charitable organization that collect second-hand clothes and he wants to know what type of clothes people needs most and for what gender. He have all the data, but he doesn’t know how to query database to get that information – computer is there for him to figure out the existing schema and to obtain any information he is requesting from it.

ex-facebook

As you probably noticed, I am giving examples as if this AI exists today. In reality (or at least, in a theoretical reality), decades will pass before we have that AI and a lot of crucial things will change. Possibilities I am presenting here are probably ridiculously elementary and without imagination, but constant is the same – computer should free us. Are we going in the right direction? Well, I think it depends how you look at the things. I feel we are more interested in optimizing our lovely little retarded tools that we currently have than investing in this approach. On the other hand, with current state of our knowledge of physics, materials we use and state of software engineering, I don’t think we could do better at this point anyway – it will take time for conditions to evolve before we start going there. Anyhow, child in me like to think that this is the future and that this future is near…even if it means I would be part of layoff as technologically redundant.

Power to the people, right on

 

Similar to how we managed to bring books from monastery’s elite to peasants using Gutenberg’s machine, similar to how we provided electricity to those that couldn’t tell the difference between AC and DC even if it hit them in the head and similar to how we brought computers even to the people who were not wearing thick glasses and didn’t have tics, goal for the centuries to come is to bring programming to masses. And this is the next step to really empower people. I am hoping only that there will come a day when human civilization will see and speak to real Multivac and not only just read about it. And as for the reading – thank you for reading this!

Posted in Uncategorized | Leave a comment

Visualizing world’s births and deaths

All data presented here is available as spreadsheet.

World population grows exponentially. Can we graph accurately world’s births and deaths since the dawn of the civilization? Of course not, but in this post I will try something (that I don’t think) anyone tried before. Idea is simple – every wikipedia article about a person has a category that is like “YYYY births” and “YYYY deaths” (if that person is dead). We can use this metadata to create a graph depicting births and deaths  in that specific year. I used Pywikibot to fetch this data. Here is the source code:

#!/usr/bin/env python
import wikipedia as pywikibot
import catlib
 
def main():
    site = pywikibot.getSite(code='en')
    for year in range(1, 2010):
        #cat = catlib.Category(site, "%s:%s" % (site.namespace(14), "%d_deaths" %year))
        cat = catlib.Category(site, "%s:%s" % (site.namespace(14), "%d_births" %year))
        print "%d - %d" % (year, len(cat.articlesList()))
 
if __name__ == '__main__':
    main()

And as already said, here is the spreadsheet with all the data. Before presenting results, just a small disclaimer – this approach has several disadvantages. First of all, results are collected from English Wikipedia, and as huge as it is, it is still western-centric, so probably in this statistics we are missing a lot of Chinese and other eastern world related persons. Secondly, a lot of people, especially those born in the distant past don’t have accurate year of birth or death. In a way, presented result will reflect our knowledge of the historical people’s life more than it will present world population growth, but some interesting results can be observed nonetheless. Thirdly, these are statistics just for notable persons, but I think, it can scale to whole population as well. Please feel free to comment whatever you see I missed or to further explain things on graphs where I don’t have explanations. All the data presented here is up to year 2009.

Births

This is graph of people’s births. On the x-axis are years, and on y-axis are number of births (click to enlarge):

births

Graph presented above is not very useful, so we’ll construct better one. Graph shown below is constructed by taking averages of births in range of 10 years (to see general trends) and is in logarithmic scale (due to the exponential nature of this kind of data). It reveals much more now (click to enlarge):

births-log

Now that we have these graphs, let’s see what we can conclude from them. We’ll start chronologically:

  • (red logarithmic graph) There is large rise of births around year 170, mostly all of them are Chinese military generals (you can see that here and here). I don’t know Chinese history, but they all relate to period of Three Kingdom era. Question now is – is this era period of great development in China or  is this the period when nothing important happened in Europe?
  • (red logarithmic graph) Around 600-1100 you can see there is no rise in birth’s graph. Of course, there should be, because population growth was constant, so I think this could be explained by Dark age period.
  • (red logarithmic graph) You can see rise in the first half of 17th century that I can’t explain. Is it large rise or large fall that happened after? Can that large fall be because of some catastrophic events like plague in Europe at that time? Or, there is large, but century long continuous rise because of also some large, but century long continuous catastrophe? Can baby boomer generation happen, but in large scale, spanning for decades? Or maybe it is just a pinnacle of renaissance, so there was a lot of famous people at that time?
  • (blue, linear graph) You can see very large rise in births after WWII (concrete year is 1947, you can see that in spreadsheet). I think this is very good visualization of baby boomer generation.
  • (blue, linear graph) Highest peak in births is around 1979-1985 (largest number of people born in Wikipedia is in 1982.), so I guess that you can, looking at this, see what is the most probable year of one’s affirmation and recognition. If you’re older than this, then your chances to become famous are becoming slimmer and slimmer every day:)

Deaths

Let’s take a look on number of deaths over the years. Here’s a linear graph (click to enlarge):

deaths

And here’s the same graph, but averaged in 10 years period and in logarithmic scale (click to enlarge):

deaths-log

Some observations I came up with:

  • (red logarithmic graph) The first thing you notice on this graph is a large peak around year 304. Reason is Diocletian persecution of Christians that happened that year and a large number of people that will later become Christian martyrs died. See for yourself.
  • (red logarithmic graph) Also interesting is a large drop of deaths around year 440 for which I really don’t have an explanation.
  • (red logarithmic graph) As in birth’s graph, here we also see a large rise in deaths between 16th and 17th century (or a large drop of deaths after that?) which  is also puzzling for me.
  • (blue, linear graph) Sudden jump around 1914-1918 is pretty clear. Deaths of millions, greatest tragedy ever, visualized as just a few pixels…
  • (blue, linear graph) Also jump around 1938-1945 shows how many great people died in WWII (and also millions of them unknown, that don’t have their place in Wikipedia…)
  • (red logarithmic graph) In contrast to birth’s graph that drops rapidly after 1985., graph of deaths just continues to grow, logically.

Is there anything else you can add to this observations that I missed? Can you explain better than me slopes in these graphs? Let’s squeeze all the facts we can from these data!

Posted in Analitika, Programiranje, Python | Tagged , , | 1 Comment

Allow access to whole S3 Bucket to IAM user

It took me a while to figure this out. Googling helped, but the answers are not obvious. So, you have IAM user and you want to grant that user complete read-write access to some bucket. Catch is that you need two statements to achieve this. Here is full bucket policy (just replace “YourIAMUser” and “YourBucketName” in the policy below):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Principal": {
                "AWS": "arn:aws:iam::821707826313:user/YourIAMUser"
            },
            "Resource": [
                "arn:aws:s3:::YourBucketName"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Principal": {
                "AWS": "arn:aws:iam::821707826313:user/YourIAMUser"
            },
            "Resource": [
                "arn:aws:s3:::YourBucketName/*"
            ]
        }
    ]
}

So, explanation now – as I already mentioned, notice that we have two separate statements (lines 3-14 and 15-28).

  • First one allow IAM user to “list buckets” (line 6) and resource given here is just plain ARN to the bucket (line 12)
  • Second statement gives that IAM user permissions on objects in bucket (lines 18-20), but resource given here is path to your bucket plus “/*” (line 26). This is the key thing I was missing when trying to create policy using AWS policy tool.

Hope this helps you!

Posted in Uncategorized | Leave a comment
« Older