Parsing PDFs
Strategies (with R)
Cyrille Médard de Chardon
Problem: 12,200+ pages (over 8 years of data)
Source: Administration de la Navigation Aérienne
One solution (using R)
library (pdftools)
all_pdf <- pdf_data ('imgs/2020-byhour_page1.pdf' )
print (all_pdf[[1 ]])
width height x y space text
1 79 14 313 109 TRUE Luxembourg
2 43 14 397 109 FALSE Airport
3 30 12 307 125 TRUE Noise
4 63 12 341 125 TRUE Distribution
5 13 12 408 125 TRUE by
6 26 12 424 125 FALSE Hour
7 36 10 234 145 TRUE Between
8 47 10 274 145 TRUE 01/01/2020
9 36 10 323 145 TRUE 00:00:00
10 15 10 362 145 TRUE and
11 47 10 381 145 TRUE 01/01/2020
12 36 10 430 145 TRUE 23:59:59
13 25 10 470 145 TRUE (Local
14 24 10 498 145 FALSE Time)
15 14 8 138 464 FALSE 57.1
16 14 8 193 464 FALSE 56.2
17 14 8 243 464 FALSE 50.0
18 14 8 291 464 FALSE 52.2
19 14 8 337 464 FALSE 59.0
20 14 8 383 464 FALSE 59.4
21 19 9 356 548 TRUE Page
22 4 9 378 548 TRUE 1
23 7 9 385 548 TRUE of
24 4 9 395 548 FALSE 4
25 4 9 701 282 FALSE 0600
26 4 9 691 282 FALSE 0500
27 4 9 680 282 FALSE 0400
28 4 9 670 282 FALSE 0300
29 4 9 659 282 FALSE 0200
30 4 9 648 282 FALSE 0100
31 4 9 637 282 FALSE 0000
32 4 9 627 282 FALSE 2300
33 4 9 616 282 FALSE 2200
34 4 9 605 282 FALSE 2100
35 9 4 576 293 FALSE Hour
36 9 4 490 304 TRUE Total
37 11 4 500 304 FALSE LDEN
38 22 4 540 304 TRUE Background
39 11 4 564 304 FALSE LDEN
40 13 4 591 304 TRUE Aircraft
41 11 4 605 304 FALSE LDEN
42 13 4 641 304 TRUE Aircraft
43 10 4 656 304 TRUE Noise
44 12 4 668 304 FALSE Events
45 7 4 445 317 FALSE 100
46 4 4 448 333 FALSE 90
47 4 4 448 349 FALSE 80
48 4 4 448 365 FALSE 70
49 4 4 448 381 FALSE 60
50 9 4 580 428 FALSE Hour
51 11 5 552 446 TRUE Total
52 22 5 566 446 FALSE LEvening
53 11 5 600 446 TRUE Total
54 15 5 614 446 FALSE LNight
55 4 9 705 418 FALSE 0600
56 4 9 694 418 FALSE 0500
57 4 9 684 418 FALSE 0400
58 4 9 673 418 FALSE 0300
59 4 9 663 418 FALSE 0200
60 4 9 651 418 FALSE 0100
61 4 9 641 418 FALSE 0000
62 4 9 630 418 FALSE 2300
63 4 9 620 418 FALSE 2200
64 4 9 609 418 FALSE 2100
65 4 9 598 418 FALSE 2000
66 4 9 587 418 FALSE 1900
67 4 9 577 418 FALSE 1800
68 4 9 566 418 FALSE 1700
69 4 9 555 418 FALSE 1600
70 4 9 544 418 FALSE 1500
71 4 9 533 418 FALSE 1400
72 4 9 523 418 FALSE 1300
73 4 9 512 418 FALSE 1200
74 4 4 448 413 FALSE 40
75 4 9 501 418 FALSE 1100
76 4 4 448 397 FALSE 50
77 11 5 504 446 TRUE Total
78 13 5 517 446 FALSE LDAY
79 8 8 102 464 FALSE 61
80 4 9 595 282 FALSE 2000
81 2 4 713 277 FALSE 0
82 4 9 584 282 FALSE 1900
83 4 4 446 277 FALSE 40
84 4 9 574 282 FALSE 1800
85 2 4 713 261 FALSE 2
86 4 9 563 282 FALSE 1700
87 4 4 446 261 FALSE 50
88 4 9 553 282 FALSE 1600
89 2 4 713 243 FALSE 4
90 4 9 542 282 FALSE 1500
91 4 4 446 243 FALSE 60
92 4 9 531 282 FALSE 1400
93 2 4 713 227 FALSE 6
94 4 9 521 282 FALSE 1300
95 4 4 446 227 FALSE 70
96 4 9 509 282 FALSE 1200
97 2 4 713 210 FALSE 8
98 4 9 499 282 FALSE 1100
99 4 4 446 210 FALSE 80
100 4 9 488 282 FALSE 1000
101 4 4 713 193 FALSE 10
102 4 9 478 282 FALSE 0900
103 4 4 446 193 FALSE 90
104 4 9 490 418 FALSE 1000
105 2 8 388 191 FALSE -
106 2 8 388 201 FALSE -
107 2 8 388 212 FALSE -
108 2 8 388 222 FALSE -
109 2 8 388 232 FALSE -
110 2 8 388 242 FALSE -
111 2 8 388 252 FALSE -
112 2 8 388 263 FALSE -
113 2 8 388 273 FALSE -
114 2 8 388 283 FALSE -
115 2 8 388 293 FALSE -
116 2 8 388 303 FALSE -
117 2 8 388 314 FALSE -
118 2 8 388 324 FALSE -
119 2 8 388 334 FALSE -
120 2 8 388 344 FALSE -
121 14 8 383 354 FALSE 53.2
122 14 8 383 365 FALSE 51.8
123 14 8 383 375 FALSE 52.2
124 14 8 383 385 FALSE 53.3
125 14 8 383 395 FALSE 49.9
126 14 8 383 405 FALSE 50.7
127 14 8 383 416 FALSE 51.6
128 14 8 383 426 FALSE 67.7
129 4 9 480 418 FALSE 0900
130 2 8 341 192 FALSE -
131 2 8 341 202 FALSE -
132 2 8 341 212 FALSE -
133 2 8 341 222 FALSE -
134 2 8 341 232 FALSE -
135 2 8 341 243 FALSE -
136 2 8 341 253 FALSE -
137 2 8 341 263 FALSE -
138 2 8 341 273 FALSE -
139 2 8 341 283 FALSE -
140 2 8 341 294 FALSE -
141 2 8 341 304 FALSE -
142 14 8 336 314 FALSE 55.1
143 14 8 336 324 FALSE 61.9
144 14 8 336 334 FALSE 55.7
145 14 8 336 345 FALSE 59.8
146 2 8 341 355 FALSE -
147 2 8 341 365 FALSE -
148 2 8 341 375 FALSE -
149 2 8 341 385 FALSE -
150 2 8 341 396 FALSE -
151 2 8 341 406 FALSE -
152 2 8 341 416 FALSE -
153 2 8 341 426 FALSE -
154 4 9 467 282 FALSE 0800
155 14 8 292 191 FALSE 39.2
156 14 8 292 201 FALSE 57.1
157 14 8 292 211 FALSE 50.8
158 14 8 292 221 FALSE 54.3
159 14 8 292 232 FALSE 51.9
160 14 8 292 242 FALSE 51.0
161 14 8 292 252 FALSE 42.9
162 14 8 292 262 FALSE 50.9
163 14 8 292 273 FALSE 51.9
164 14 8 292 283 FALSE 53.8
165 14 8 292 293 FALSE 50.0
166 14 8 292 303 FALSE 52.0
167 2 8 297 313 FALSE -
168 2 8 297 324 FALSE -
169 2 8 297 334 FALSE -
170 2 8 297 344 FALSE -
171 2 8 297 354 FALSE -
172 2 8 297 364 FALSE -
173 2 8 297 375 FALSE -
174 2 8 297 385 FALSE -
175 2 8 297 395 FALSE -
176 2 8 297 405 FALSE -
177 2 8 297 415 FALSE -
178 2 8 297 426 FALSE -
179 4 9 457 282 FALSE 0700
180 14 8 243 191 FALSE 39.2
181 14 8 243 201 FALSE 41.3
182 14 8 243 211 FALSE 44.9
183 14 8 243 221 FALSE 45.7
184 14 8 243 232 FALSE 45.6
185 14 8 243 242 FALSE 43.6
186 14 8 243 252 FALSE 42.9
187 14 8 243 262 FALSE 45.9
188 14 8 243 273 FALSE 45.7
189 14 8 243 283 FALSE 48.2
190 14 8 243 293 FALSE 44.4
191 14 8 243 303 FALSE 45.4
192 14 8 243 313 FALSE 49.9
193 14 8 243 324 FALSE 51.3
194 14 8 243 334 FALSE 51.2
195 14 8 243 344 FALSE 53.5
196 14 8 243 354 FALSE 53.2
197 14 8 243 364 FALSE 51.9
198 14 8 243 375 FALSE 52.3
199 14 8 243 385 FALSE 53.4
200 14 8 243 395 FALSE 49.9
201 14 8 243 405 FALSE 50.7
202 14 8 243 415 FALSE 51.6
203 14 8 243 426 FALSE 54.8
204 4 9 469 418 FALSE 0800
205 2 8 194 191 FALSE -
206 14 8 189 201 FALSE 56.9
207 14 8 189 211 FALSE 49.6
208 14 8 189 221 FALSE 53.7
209 14 8 189 232 FALSE 50.8
210 14 8 189 242 FALSE 50.1
211 2 8 194 252 FALSE -
212 14 8 189 262 FALSE 49.3
213 14 8 189 273 FALSE 50.8
214 14 8 189 283 FALSE 52.6
215 14 8 189 293 FALSE 48.7
216 14 8 189 303 FALSE 51.0
217 14 8 189 313 FALSE 53.4
218 14 8 189 324 FALSE 61.6
219 14 8 189 334 FALSE 53.9
220 14 8 189 344 FALSE 58.8
221 2 8 194 354 FALSE -
222 2 8 194 364 FALSE -
223 2 8 194 375 FALSE -
224 2 8 194 385 FALSE -
225 2 8 194 395 FALSE -
226 2 8 194 405 FALSE -
227 2 8 194 415 FALSE -
228 14 8 189 426 FALSE 67.5
229 4 4 713 176 FALSE 12
230 4 9 459 418 FALSE 0700
231 14 8 138 191 FALSE 39.2
232 14 8 138 201 FALSE 57.1
233 14 8 138 211 FALSE 50.8
234 14 8 138 221 FALSE 54.3
235 14 8 138 232 FALSE 51.9
236 14 8 138 242 FALSE 51.0
237 14 8 138 252 FALSE 42.9
238 14 8 138 262 FALSE 50.9
239 14 8 138 273 FALSE 51.9
240 14 8 138 283 FALSE 53.8
241 14 8 138 293 FALSE 50.0
242 14 8 138 303 FALSE 52.0
243 14 8 138 313 FALSE 55.1
244 14 8 138 324 FALSE 61.9
245 14 8 138 334 FALSE 55.7
246 14 8 138 344 FALSE 59.8
247 14 8 138 354 FALSE 53.2
248 14 8 138 364 FALSE 51.8
249 14 8 138 375 FALSE 52.2
250 14 8 138 385 FALSE 53.3
251 14 8 138 395 FALSE 49.9
252 14 8 138 405 FALSE 50.7
253 14 8 138 415 FALSE 51.6
254 14 8 138 426 FALSE 67.7
255 4 10 435 236 TRUE Noise
256 4 10 435 224 TRUE Level
257 4 10 435 212 FALSE (dBA)
258 2 8 105 191 FALSE -
259 4 8 104 201 FALSE 3
260 4 8 104 211 FALSE 3
261 4 8 104 221 FALSE 3
262 4 8 104 232 FALSE 3
263 4 8 104 242 FALSE 1
264 2 8 105 252 FALSE -
265 4 8 104 262 FALSE 3
266 4 8 104 273 FALSE 3
267 4 8 104 283 FALSE 8
268 4 8 104 293 FALSE 3
269 4 8 104 303 FALSE 5
270 4 8 104 313 FALSE 3
271 4 8 104 324 FALSE 5
272 4 8 104 334 FALSE 4
273 8 8 102 344 FALSE 12
274 2 8 105 354 FALSE -
275 2 8 105 364 FALSE -
276 2 8 105 375 FALSE -
277 2 8 105 385 FALSE -
278 2 8 105 395 FALSE -
279 2 8 105 405 FALSE -
280 2 8 105 415 FALSE -
281 4 8 104 426 FALSE 2
282 42 10 488 160 TRUE Location:
283 44 10 535 160 TRUE Roodt/Syr
284 20 10 589 160 TRUE NMT
285 5 10 612 160 TRUE #
286 5 10 620 160 FALSE 1
287 7 4 443 176 FALSE 100
288 4 5 721 212 TRUE AC
289 4 10 721 220 TRUE Noise
290 4 12 721 232 FALSE Events
291 29 8 36 191 TRUE 01/01/20
292 14 8 72 191 FALSE 7:00
293 29 8 36 201 TRUE 01/01/20
294 14 8 72 201 FALSE 8:00
295 29 8 36 211 TRUE 01/01/20
296 14 8 72 211 FALSE 9:00
297 29 8 36 221 TRUE 01/01/20
298 18 8 69 221 FALSE 10:00
299 29 8 36 232 TRUE 01/01/20
300 18 8 69 232 FALSE 11:00
301 29 8 36 242 TRUE 01/01/20
302 18 8 69 242 FALSE 12:00
303 29 8 36 252 TRUE 01/01/20
304 18 8 72 252 FALSE 13:00
305 29 8 36 262 TRUE 01/01/20
306 18 8 72 262 FALSE 14:00
307 29 8 36 273 TRUE 01/01/20
308 18 8 72 273 FALSE 15:00
309 29 8 36 283 TRUE 01/01/20
310 18 8 72 283 FALSE 16:00
311 29 8 36 293 TRUE 01/01/20
312 18 8 72 293 FALSE 17:00
313 29 8 36 303 TRUE 01/01/20
314 18 8 72 303 FALSE 18:00
315 29 8 36 313 TRUE 01/01/20
316 18 8 72 313 FALSE 19:00
317 29 8 36 324 TRUE 01/01/20
318 18 8 72 324 FALSE 20:00
319 29 8 36 334 TRUE 01/01/20
320 18 8 72 334 FALSE 21:00
321 29 8 36 344 TRUE 01/01/20
322 18 8 69 344 FALSE 22:00
323 29 8 36 354 TRUE 01/01/20
324 18 8 69 354 FALSE 23:00
325 29 8 36 364 TRUE 02/01/20
326 14 8 69 364 FALSE 0:00
327 29 8 36 375 TRUE 02/01/20
328 14 8 72 375 FALSE 1:00
329 29 8 36 385 TRUE 02/01/20
330 14 8 72 385 FALSE 2:00
331 29 8 36 395 TRUE 02/01/20
332 14 8 72 395 FALSE 3:00
333 29 8 36 405 TRUE 02/01/20
334 14 8 72 405 FALSE 4:00
335 29 8 36 415 TRUE 02/01/20
336 14 8 72 415 FALSE 5:00
337 29 8 36 426 TRUE 02/01/20
338 14 8 72 426 FALSE 6:00
339 30 9 91 157 FALSE Aircraft
340 20 9 380 158 FALSE Total
341 20 9 137 158 FALSE Total
342 49 9 227 158 TRUE Background
343 20 9 279 158 TRUE Total
344 20 9 301 158 TRUE LDay
345 20 9 334 158 FALSE Total
346 30 9 181 158 FALSE Aircraft
347 27 9 93 167 TRUE Events
348 23 9 128 167 TRUE LDEN
349 11 9 154 167 TRUE dB
350 23 9 172 168 TRUE LDEN
351 23 9 198 168 TRUE dB(A)
352 23 9 227 168 TRUE LDEN
353 23 9 253 168 FALSE dB(A)
354 38 9 325 168 TRUE LEvening
355 26 9 370 167 TRUE LNight
356 11 9 399 167 FALSE dB
357 23 9 289 168 FALSE dB(A)
358 11 9 384 177 FALSE (A)
359 11 9 141 177 FALSE (A)
360 23 9 333 177 FALSE dB(A)
361 4 11 437 375 TRUE Noise
362 4 10 437 363 TRUE Level
363 4 11 437 350 FALSE (dBA)
364 15 9 40 157 FALSE Day
How is the data structured?
The origin here is in the top left. PDFs have nested boxes that can be rotated - origin can be elsewhere.
A new problem?
We now have a list of individual words with their Cartesian coordinates.
It’s going to be possible but some work to parse into clean data.
A good strategy would be to break the space into areas of interest, and omit the ‘garbage’.
Attack strategy ‘Fruit Ninja’
Need to determine target bounds!
If only there was a magical program to help us understand the PDF coordinate system.
Inkscape!
The solution, as in most cases, is using Inkscape.
Here to find the the ‘box’ of data that interests us.
By setting the ruler (right-click) or coordinate units to pt (points) you see the same units reported in R.
It’s not perfect
PDFs use points, which are defined as 1/72 of an inch.
Below shows the offset between the expected location coordinates and the actual.
With a little tweaking, it works
There’s another problem
Text on a line - is not on the exact same y coordinate.
It’s often off by 1 or 2.
width height x y space text
305 29 8 36 262 TRUE 01/01/20
306 18 8 72 262 FALSE 14:00
265 4 8 104 262 FALSE 3
238 14 8 138 262 FALSE 50.9
212 14 8 189 262 FALSE 49.3
187 14 8 243 262 FALSE 45.9
162 14 8 292 262 FALSE 50.9
137 2 8 341 263 FALSE -
112 2 8 388 263 FALSE -
We can cluster the y-coordinate values, but it’s helpful if we know the number of rows.
Cluster into rows
Results
Table of value frequencies (abridged), before classification/clustering:
191 192 201 202 211 212 221 222 232 242 243 252 253 262 263 273 283
8 1 8 1 7 2 7 2 9 8 1 8 1 7 2 9 9
After (abridged):
191 201 211 221 232 242 252 262 273 283 293 303 313 324 334 344 354
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
The rows (y coordinates) have been harmonized.
We now have cleaned data.
date hour ac_events total_LDEN_dB ac_LDEN_dB bg_LDEN_dB
191 01/01/20 7:00 - 39.2 - 39.2
201 01/01/20 8:00 3 57.1 56.9 41.3
211 01/01/20 9:00 3 50.8 49.6 44.9
221 01/01/20 10:00 3 54.3 53.7 45.7
232 01/01/20 11:00 3 51.9 50.8 45.6
242 01/01/20 12:00 1 51.0 50.1 43.6
252 01/01/20 13:00 - 42.9 - 42.9
262 01/01/20 14:00 3 50.9 49.3 45.9
273 01/01/20 15:00 3 51.9 50.8 45.7
283 01/01/20 16:00 8 53.8 52.6 48.2
293 01/01/20 17:00 3 50.0 48.7 44.4
303 01/01/20 18:00 5 52.0 51.0 45.4
313 01/01/20 19:00 3 55.1 53.4 49.9
324 01/01/20 20:00 5 61.9 61.6 51.3
334 01/01/20 21:00 4 55.7 53.9 51.2
344 01/01/20 22:00 12 59.8 58.8 53.5
354 01/01/20 23:00 - 53.2 - 53.2
364 02/01/20 0:00 - 51.8 - 51.9
375 02/01/20 1:00 - 52.2 - 52.3
385 02/01/20 2:00 - 53.3 - 53.4
395 02/01/20 3:00 - 49.9 - 49.9
405 02/01/20 4:00 - 50.7 - 50.7
415 02/01/20 5:00 - 51.6 - 51.6
426 02/01/20 6:00 2 67.7 67.5 54.8
The data
“Noise above 70 dB over a prolonged period of time may start to damage your hearing. Loud noise above 120 dB can cause immediate harm to your ears.”
- CDC
Relationship between events and noise (2020 data)
I’m not sure where the locations are.
Conclusion
Parsing PDFS
Easy to start / getting some data
Extracting boxes works well
Aligning text into rows can be challenging
Data munging is heavy
Overall a nice visual exercise that is methodologically interesting
Next steps
Process the other years of data (2012-2020) ✅
Upload cleaned CSV to data.public.lu ✅
Georeference the sensor locations
Analyze 2012-2020 data