Saturday, 23 June 2012

Extracting metadata using Apache Tika

Looking at using Apache Tika to pull metadata from photos and mp3 files. It is a single jar that runs as a command line, GUI or http server.

MP3


java -jar ~andy/tika-app-1.1.jar 12\ Time\ Of\ The\ Season.mp3
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="xmpDM:audioCompressor" content="MP3"/>
<meta name="xmpDM:releaseDate" content="1968"/>
<meta name="Content-Length" content="5632909"/>
<meta name="xmpDM:album" content="Odessey &amp; Oracle"/>
<meta name="xmpDM:artist" content="Zombies"/>
<meta name="Author" content="Zombies"/>
<meta name="xmpDM:genre" content=""/>
<meta name="xmpDM:logComment" content=""/>
<meta name="Content-Type" content="audio/mpeg"/>
<meta name="resourceName" content="12 Time Of The Season.mp3"/>
<title>Time Of The Season</title>
</head>
<body><h1>Time Of The Season</h1>
<p>Zombies</p>
<p>Odessey &amp; Oracle, track 12</p>
<p>1968</p>
</body></html>
view raw gistfile1.xml hosted with ❤ by GitHub

JPEG

java -jar ~andy/tika-app-1.1.jar skating.jpg
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Software" content="Picasa"/>
<meta name="GPS Altitude Ref" content="Sea level"/>
<meta name="subject" content="greenwich"/>
<meta name="subject" content="ice"/>
<meta name="subject" content="skate"/>
<meta name="Content-Length" content="96154"/>
<meta name="Exif Version" content="2.20"/>
<meta name="date" content="2012-06-23T20:10:18"/>
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 2 horiz/2 vert"/>
<meta name="tiff:ImageLength" content="768"/>
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="GPS Latitude" content="51&quot;28'58.75122"/>
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="GPS Latitude Ref" content="N"/>
<meta name="description" content="solange, ann and oliver ice skating collage"/>
<meta name="tiff:ImageWidth" content="1024"/>
<meta name="Image Width" content="1024 pixels"/>
<meta name="resourceName" content="skating.jpg"/>
<meta name="Keywords" content="greenwich"/>
<meta name="Keywords" content="ice"/>
<meta name="Keywords" content="skate"/>
<meta name="GPS Longitude Ref" content="W"/>
<meta name="GPS Longitude" content="0&quot;0'25.3368"/>
<meta name="Caption/Abstract" content="solange, ann and oliver ice skating collage"/>
<meta name="tiff:Software" content="Picasa"/>
<meta name="Number of Components" content="3"/>
<meta name="Image Height" content="768 pixels"/>
<meta name="Data Precision" content="8 bits"/>
<meta name="tiff:BitsPerSample" content="8"/>
<meta name="geo:lat" content="51.48299"/>
<meta name="Last-Modified" content="2012-06-23T20:10:18"/>
<meta name="Unknown tag (0xa420)" content="45925b38462f517f73a2bab9760ae5ac"/>
<meta name="Date/Time" content="2012:06:23 20:10:18"/>
<meta name="Directory Version" content="4"/>
<meta name="geo:long" content="-0.00704"/>
<meta name="GPS Version ID" content="2 2 0 0"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
view raw gistfile1.xml hosted with ❤ by GitHub
Looking good.

Monday, 11 June 2012

Upgrading to dojo 1.7

Starting to upgrade the places website to dojo 1.7  migration-17
And trying to get to grips with the build system 

Issues

Web based code editors

Looking for a nice web based code edit component. I wanted XQuery and  Sparql  in addition to the usual suspects. The choice seems to be between Codemirror and Ace.

Codemirror supports both Sparql and XQuery out of the box.

Ace has very nice XQuery support initially developed for the eXide feature of eXist by Wolfgang Meier and then enhanced, and pushed back to the main  Ace project, by William Candillon. Ace currently has no official Sparql mode, but I spotted Callimachus  has the code  for a slightly earlier Ace release.

For me the deciding factor is that ACE is built with  AMD support, whereas Codemirror is not. There is a project to convert/wrap Codemirror for use with Dojo, which means using AMD,  but it looks like a lot of work.
Related projects to watch:   Treehugger and Dojo Widget for Ace

Tuesday, 5 June 2012

BaseX on Debian

Current install strategy:
  • unzip to /usr/local/share/basex-x.y 
  • link basex to basex-x.y
velvet:/usr/bin# 
ln -s /usr/local/share/basex/bin/basexclient
ln -s /usr/local/share/basex/bin/basexserver
ln -s /usr/local/share/basex/bin/basexhttp
ln -s /usr/local/share/basex/bin/xquery 
 
TODO init.d

Monday, 4 June 2012

Marklogic 5

I picked up a MarkLogic USB drive with 5.0.2 at XML Prague 2012. I installed this on Ubuntu but I wanted to get it to run on my Readynas Pro 2. This wish resulted in a factory reset on my first try, while attempting to install alien.

Older and wiser I tried again using chroot with a Debian Sarge image. The source file from the USB was MarkLogic-5.0-2.i686.rpm. Instructions based on
http://developer.marklogic.com/download#comment-354202122

To install

After doing debootstrap to create /c/home/squeeze
chroot /c/home/squeeze 

sudo apt-get install alien
sudo alien --to-deb --verbose MarkLogic-5.0-2.i686.rpm
sudo dpkg -i marklogic_5.0-3_i386.deb

To run

mount proc /c/home/squeeze/proc/  -t proc
cp /etc/resolv.conf /c/home/squeeze/etc
cp /etc/hosts /c/home/squeeze/etc
chroot /c/home/squeeze
/etc/init.d/MarkLogic start

The result 

 

Sunday, 3 June 2012

BaseX event handling

Trying to update BaseX-node to handle events.
It works the 1st time:

21:19:01.858 [127.0.0.1:38155] LOGIN admin OK
21:19:01.861 [127.0.0.1:38156] LOGIN test1 OK
21:19:02.003 [127.0.0.1:38155] CREATE EVENT messenger
21:19:02.005 [127.0.0.1:38155] OK 143.0 ms
21:19:02.027 [127.0.0.1:38156] OK 13.41 ms
21:19:02.247 [127.0.0.1:38156] QUERY(0) for $i in 1 to 1000000 where $i=3 return $i OK 219.51 ms
21:19:02.248 [127.0.0.1:38156] QUERY(0) OK 0.5 ms
21:19:02.252 [127.0.0.1:38155] QUERY(0) db:event('messenger', 'Hello World!') OK 230.45 ms
21:19:02.254 [127.0.0.1:38155] QUERY(0) OK 0.76 ms
21:19:02.584 [127.0.0.1:38155] EXEC(0) OK 329.22 ms
21:19:02.943 [127.0.0.1:38156] EXEC(0) OK 694.17 ms
21:19:02.945 [127.0.0.1:38156] OK 0.34 ms
21:19:03.004 [127.0.0.1:38155] DROP EVENT messenger
21:19:03.005 [127.0.0.1:38155] OK 2.18 ms
21:19:03.006 [127.0.0.1:38155] EXIT
21:19:03.011 [127.0.0.1:38155] OK 6.03 ms
21:19:03.012 [127.0.0.1:38155] LOGOUT admin OK
21:19:03.014 [127.0.0.1:38156] EXIT
21:19:03.019 [127.0.0.1:38156] OK 6.57 ms
21:19:03.020 [127.0.0.1:38156] LOGOUT test1 OK

Then it hangs..

21:19:16.106 [127.0.0.1:38158] LOGIN admin OK
21:19:16.112 [127.0.0.1:38158] CREATE EVENT messenger
21:19:16.115 [127.0.0.1:38158] OK 27.79 ms
21:19:16.151 [127.0.0.1:38159] LOGIN test1 OK
21:19:16.192 [127.0.0.1:38158] QUERY(0) db:event('messenger', 'Hello World!') OK 0.98 ms
21:19:16.200 [127.0.0.1:38158] QUERY(0) OK 7.4 ms
21:19:16.207 [127.0.0.1:38159] OK 0.65 ms
21:19:16.220 [127.0.0.1:38159] QUERY(0) for $i in 1 to 1000000 where $i=3 return $i OK 7.77 ms
21:19:16.231 [127.0.0.1:38159] QUERY(0) OK 4.3 ms
21:19:16.253 [127.0.0.1:38158] EXEC(0) Error: null
21:19:16.732 [127.0.0.1:38159] EXEC(0) OK 459.42 ms
21:19:20.036 [127.0.0.1:38158] LOGOUT admin OK
21:19:20.042 [127.0.0.1:38159] LOGOUT test1 OK
What is that error?