Fix stream parsing

This commit fixes bug 811. Bug 811 was caused because tokens were referring to a buffer that was reallocated and thus the pointers were not valid anymore. Background: The buffered input stream avoids copying the whole input stream before handing it to ANTLR (in contrast to the non-buffered input stream that first copies everything into a buffer). This enables interactivity (e.g. with kind2) and may save memory. CVC4 uses it when reading from stdin in competition mode for the application track (the incremental benchmarks) and in non-competition mode. To set the CVC4_SMTCOMP_APPLICATION_TRACK flag, the {C,CXX}FLAGS have to be modified at configure time. Solution: This commit fixes the issue by changing how a stream gets buffered. Instead of storing the stream into a single buffer, CVC4 now stores each line in a separate buffer, making sure that they do not have to move, keeping tokens valid. The commit adds the LineBuffer class for managing those buffers. It further modifies CVC4's LA and consume functions to use line number and position within a line to index into the line buffer. This allows us to use the standard mark()/etc. functions because they automatically store and restore that state. The solution also (arguably) simplifies the code. Disadvantages: Tokens split across lines would cause problems (seems reasonable to me). One allocation per line. Alternatives considered: Pull request 162 by Tim was a first attempt to solve the problem. The issues with this solution are: memory usage (old versions of the buffer do not get deleted), tokens split across buffers would be problematic, and mark()/rewind()/etc. would have to be overwritten for the approach to work. I had a partially working fix that used indexes into the stream instead of pointers to memory. The solution stored the content of the stream into a segmented buffer (lines were not guaranteed to be consecutive in memory. This approach was working for basic use cases but had the following issues: ugly casting (the solution requires casting the index to a pointer and storing it in the input stream's nextChar because that's where ANTLR is taking the location information from when creating a token), more modifications (not only would this solution require overwriting more functions of the input stream such as substr, it also requires changes to the use of GETCHARINDEX() in the Smt2 parser and AntlrInput::tokenText() for example), more complex code.
author: Andres Nötzli <andres.noetzli@gmail.com> 2017-06-16 17:42:22 -0700
committer: GitHub <noreply@github.com> 2017-06-16 17:42:22 -0700
commit: 0da095c3f5be79a85c085b4b46d7a0a0513ecdd6 (patch)
tree: b651858c19ac8afc6f8cebf69f958dfe97781b86 /src/parser/antlr_input.h
parent: b3e1f7ade822d061f276a9477f1ea67fcb1f3a50 (diff)
1 files changed, 6 insertions, 16 deletions
diff --git a/src/parser/antlr_input.h b/src/parser/antlr_input.h
index 8e5e82811..293be0087 100644
--- a/src/parser/antlr_input.h
+++ b/src/parser/antlr_input.h
@@ -35,12 +35,12 @@
 #include "base/output.h"
 #include "parser/bounded_token_buffer.h"
 #include "parser/input.h"
+#include "parser/line_buffer.h"
 #include "parser/parser_exception.h"
 #include "util/bitvector.h"
 #include "util/integer.h"
 #include "util/rational.h"
 
-
 namespace CVC4 {
 
 class Command;
@@ -62,10 +62,11 @@ private:
    */
   pANTLR3_UINT8 d_inputString;
 
-  AntlrInputStream(std::string name,
-                   pANTLR3_INPUT_STREAM input,
-                   bool fileIsTemporary,
-                   pANTLR3_UINT8 inputString);
+  LineBuffer* d_line_buffer;
+
+  AntlrInputStream(std::string name, pANTLR3_INPUT_STREAM input,
+                   bool fileIsTemporary, pANTLR3_UINT8 inputString,
+                   LineBuffer* line_buffer);
 
   /* This is private and unimplemented, because you should never use it. */
   AntlrInputStream(const AntlrInputStream& inputStream) CVC4_UNDEFINED;
@@ -201,9 +202,6 @@ public:
   /** Get a bitvector constant from the text of the number and the size token */
   static BitVector tokenToBitvector(pANTLR3_COMMON_TOKEN number, pANTLR3_COMMON_TOKEN size);
 
-  /** Retrieve the remaining text in this input. */
-  std::string getUnparsedText();
-
   /** Get the ANTLR3 lexer for this input. */
   pANTLR3_LEXER getAntlr3Lexer() { return d_lexer; }
 
@@ -243,14 +241,6 @@ protected:
   virtual void setParser(Parser& parser);
 };/* class AntlrInput */
 
-inline std::string AntlrInput::getUnparsedText() {
-  const char *base = (const char *)d_antlr3InputStream->data;
-  const char *cur = (const char *)d_antlr3InputStream->nextChar;
-
-  return std::string(cur, d_antlr3InputStream->sizeBuf - (cur - base));
-}
-
-
 inline std::string AntlrInput::tokenText(pANTLR3_COMMON_TOKEN token) {
   if( token->type == ANTLR3_TOKEN_EOF ) {
     return "<<EOF>>";
author	Andres Nötzli <andres.noetzli@gmail.com>	2017-06-16 17:42:22 -0700
committer	GitHub <noreply@github.com>	2017-06-16 17:42:22 -0700
commit	0da095c3f5be79a85c085b4b46d7a0a0513ecdd6 (patch)
tree	b651858c19ac8afc6f8cebf69f958dfe97781b86 /src/parser/antlr_input.h
parent	b3e1f7ade822d061f276a9477f1ea67fcb1f3a50 (diff)